uncoverml package¶
Subpackages¶
- uncoverml.optimise package
- uncoverml.scripts package
- Submodules
- uncoverml.scripts.cluster_cli module
- uncoverml.scripts.covdiag_cli module
- uncoverml.scripts.gammasensor_cli module
- uncoverml.scripts.gridsearch_cli module
- uncoverml.scripts.learn_cli module
- uncoverml.scripts.modelfix_cli module
- uncoverml.scripts.predict_cli module
- uncoverml.scripts.resample_cli module
- uncoverml.scripts.shiftmap_cli module
- uncoverml.scripts.subsample_cli module
- uncoverml.scripts.targetsearch_cli module
- uncoverml.scripts.tiff2kmz_cli module
- Module contents
- uncoverml.transforms package
Submodules¶
uncoverml.cluster module¶
-
class
uncoverml.cluster.
KMeans
(k, oversample_factor)¶ Bases:
object
Model object implementing learn and predict with K-means
- Parameters
k (int > 0) – The number of classes to cluster the data into
oversample_factor (int > 1) – Controls the number of samples draws as part of [1] in the initialisation step. More mpi nodes will increase the total number of points. Consider values of 1 for more than about 16 nodes
References
- 1
Bahmani, Bahman, Benjamin Moseley, Andrea
Vattani, Ravi Kumar, and Sergei Vassilvitskii. “Scalable k-means++.” Proceedings of the VLDB Endowment 5, no. 7 (2012): 622-633.
-
learn
(x, indices=None, classes=None)¶ Find the cluster centres using k-means||
- Parameters
x (ndarray) – (n_samples, n_dimensions) length array containing the training samples to cluster
indices (ndarray) – (n_samples) length integer array giving the locations in x where labels exist
classes (ndarray) – (n_samples) length integer array giving the class assignments of points in x in locations given by indices
-
predict
(x, *args, **kwargs)¶
-
class
uncoverml.cluster.
TrainingData
(indices, classes)¶ Bases:
object
Light wrapper for the indices and values of training data
- Parameters
indices (ndarray) – length N array of the indices of the input data that have classes assigned
classes (ndarray) – length N int array of the class values at locations specified by indices
-
uncoverml.cluster.
centroid
(X, weights=None)¶ Compute the centroid of a set of points X
The points X may have repetitions given by the weights.
- Parameters
X (ndarray) – (n, d) array of n d-dimensional points
weights (ndarray (optional)) – (n,) array of weights giving the repetition (or mass?) of each X
- Returns
centroid – (d,) length array, the d-dimensional centroid point of all x in X.
- Return type
ndarray
-
uncoverml.cluster.
compute_class
(X, C, training_data=None)¶ Find the closest cluster centre for each x in X
This returns which cluster centre each X belongs to, with optional semi-supervised training data that will force an assignment of a point to a particular class
- Parameters
X (ndarray) – (n, d) array of n d-dimensional points to be evaluated
C (ndarray) – (k, d) array of cluster centres, associated with classes 0..k-1
training_data (TrainingData (optional)) – instance of TrainingData containing fixed class assignments for particular points
- Returns
classes (ndarray) – (n,) int array of class assignments (0..k-1) for each x in X
cost (float) – The total ‘cost’ of the assignment, which is the average distance of all points to their assigned centre
-
uncoverml.cluster.
compute_n_classes
(classes, config)¶ The number of cluster centres to use for K-means
Just handles the case where someone specifies k=5 but labels 10 classes in the training data. This will return k=10.
- Parameters
classes (ndarray) – an array of hard class assignments given as training data
config (Config) – The app config class holding the number of classes asked for
- Returns
k – The max of k and the number of classes referenced in the training data
- Return type
int > 0
-
uncoverml.cluster.
compute_weights
(x, C)¶ Number of points in x assigned to each centre c in C
- Parameters
x (ndarray) – (n, d) array of n d-dimensional points
C (ndarray) – (k, d) array of k cluster centres
- Returns
weights – (k,) length array giving number of x closest to each c in C
- Return type
ndarray
-
uncoverml.cluster.
initialise_centres
(X, k, l, training_data=None, max_iterations=1000)¶ Use Kmeans|| to find initial cluster centres
This algorithm finds generates log(n) candidate samples efficiently, then uses k-means to cluster them into k initial starting centres used in the main algorithm (clustering X)
- Parameters
X (ndarray) – (n,d) array of points to cluster
k (int > 0) – number of clusters
l (float > 0) – Oversample factor. See weighted_starting_candidates.
training_data (TrainingData (optional)) – Optional hard assignments of certain points in X
max_iterations (int > 0) – The algorithm will terminate after this many iterations even if it hasn’t converged.
- Returns
C_init – (k, d) array of starting cluster centres for clustering X with k-means.
- Return type
ndarray
-
uncoverml.cluster.
kmean_distance2
(x, C)¶ Compute squared euclidian distance to the nearest cluster centre
- Parameters
x (ndarray) – (n, d) array of n d-dimensional points
C (ndarray) – (k, d) array of k cluster centres
- Returns
d2_x – (n,) length array of distances from each x to the nearest centre
- Return type
ndarray
-
uncoverml.cluster.
kmeans_step
(X, C, classes, weights=None)¶ A single step of the k-means algorithm.
Assigns every point in X a centre, then computes the centroid of all x assigned to each centre, then updates that centre to be the new centroid.
- Parameters
X (ndarray) – (n, d) array of points to be clustered
C (ndarray) – (k, d) array of initial cluster centres
classes (ndarray) – (n,) array of initial class assignments
weights (ndarray (optional)) – weights for points x in X that allow for different ‘masses’ or repetitions in the centroid calculation
- Returns
C_new – (k, d) array of new cluster centres
- Return type
ndarray
-
uncoverml.cluster.
log
= <Logger uncoverml.cluster (WARNING)>¶ Never use more than this many x’s to compute a distance matrix (save memory!)
-
uncoverml.cluster.
reseed_point
(X, C, index)¶ Re-initialise the centre of a class if it loses all its members
This should almost never happen. If it does, find the point furthest from all the other cluster centres and use that. Maybe a bad idea but a decent first pass
- Parameters
X (ndarray) – (n, d) array of points
C (ndarray) – (k, d) array of cluster centres
index (int >= 0) – index between 0..k-1 of the cluster that has lost it’s points
- Returns
new_point – d-dimensional point for replacing the empty cluster centre.
- Return type
ndarray
-
uncoverml.cluster.
run_kmeans
(X, C, k, weights=None, training_data=None, max_iterations=1000)¶ Cluster points into k clusters using K-means
This is a distributed implementation of Johnson’s algorithm that performs a convex optimization to find the locally optimal assignment of points and cluster centres. It depends heavily on the inital cluster centres C
- Parameters
X (ndarray) – (n, d) array n d-dimensional of points to cluster
C (ndarray) – (k, d) array of initial cluster centres
k (int > 0) – number of clusters
weights (ndarray (optional)) – (n,) array of optional repetition weights for points in X, A weight of 2. implies there are 2 points at that location
training_data (TrainingData (optional)) – An instance of the TrainingData class containing fixed cluster assignments for some of the x in X
max_iterations (int > 0 (optional)) – The algorithm will return after this many iterations, even if it hasn’t converged
- Returns
C (ndarray) – (k, d) array of final cluster centres, ordered (0..k-1)
classes (ndarray) – (n,) array of class assignments (0..k-1) for each x in X
-
uncoverml.cluster.
sum_axis_0
(x, y, dtype)¶ Reduce operation that sums 2 arrays on axis zero
-
uncoverml.cluster.
weighted_starting_candidates
(X, k, l)¶ Generate (weighted) candidates to initialise the full k-means
See the kmeans|| algorithm/paper for details. The goal is to find points that are good starting cluster centres for a full kmeans using only log(n) passes through the data
- Parameters
X (ndarray) – (n, d) array of n d-dimensional points to be clustered
k (int > 0) – number of clusters
l (float > 0) – The ‘oversample factor’ that controls how many candidates are found. Candidates are found independently on each node so this can be smaller with a bigger computation.
- Returns
w (ndarray) – The ‘weights’ of the cluster centres, which are the number of points in X closest to each centre
C (ndarray) – The cluster centres themselves. The total candidates is not known beforehand so the array will be shaped (z, d) where z is some number that increases with l.
uncoverml.config module¶
Handles parsing of the configuration file.
-
class
uncoverml.config.
Config
(yaml_file, clustering=False, learning=False, resampling=False, predicting=False, shiftmap=True)¶ Bases:
object
Class representing the global configuration of the uncoverml scripts.
This class is mostly read-only, but it does also contain the Transform objects which have state. In some execution paths, config flags are switched off then back on (e.g. in cross validation).
Along with the YAML file, the init also takes some flags. These are set by the top-level CLI scripts and are used to determine what parameters to load and what can be ignored.
All attributes following output_dir (located at the bottom of init) are undocumented but should be self-explanatory. They are full paths to output for different features.
Todo
Factor out stateful Transform objects.
- Parameters
yaml_file (str) – The path to the yaml config file.
clustering (bool) – True if clustering.
learning (bool) – True if learning.
resampling (bool) – True if resampling.
predicting (bool) – True if predicting.
-
name
¶ Name oo the config file.
- Type
str
-
algorithm_args
¶ A dictionary of arguments to pass to selected model. See Models for available arguments to model. Key is the argument name exactly as it appears in model __init__ (this dict gets passed as kwargs).
- Type
dict(str, any)
-
cubist
¶ True if cubist algorithm is being used.
- Type
bool
-
multicubist
¶ True if multicubist algorithm is being used.
- Type
bool
-
multirandomforest
¶ True if multirandomforest algorithm is being used.
- Type
bool
-
krige
¶ True if kriging is being used.
- Type
bool
-
bootstrap
¶ True if a bootstrapped algorithm is being used.
- Type
bool
-
clustering
¶ True if clustering is being performed.
- Type
bool
-
n_classes
¶ Number of classes to cluster into. Required if clustering.
- Type
int
-
oversample_factor
¶ Controls how many candidates are found for cluster initialisation when running kmeans clustering. See
weighted_starting_candidates()
. Required when clustering.- Type
float
-
cluster_analysis
¶ True if analysis should be performed post-clustering. Optional, default is False.
- Type
bool, optional
-
class_file
¶ Define classes for clustering feature data. Path to shapefile that defines class at positions.
- Type
str or bytes, optional
-
semi_supervised
¶ True if semi_supervised clustering is being performed (i.e. class_file has been provided).
- Type
bool
-
target_search
¶ True if target_search feature is being used.
- Type
bool
-
target_search_threshold
¶ Target search threshold, float between 0 and 1. The likelihood a training point must surpass to be included in found points.
- Type
float
-
target_search_extents
¶ A bounding box defining the image area to search for additional targets.
- Type
tuple(float, float, float, float)
-
tse_are_pixel_coordinates
¶ If True, target_search_extents are treated as pixel coordinates instead of CRS coordinates.
- Type
bool
-
extents
¶ A bounding box defining the area to learn and predict on. Data outside these extents gets cropped. Optional, if not provided whole image area is used.
- Type
tuple(float, float, float, float), optional
-
extents_are_pixel_coordinates
¶ If True, extents are treated as pixel coordinates instead of CRS coordinates.
- Type
bool
-
pk_covarates
¶ Path to where to save pickled covariates, or a pre-existing covariate pickle file if loading pickled covariates.
- Type
str or bytes
-
pk_targets
¶ Path to where to save pickled targets, or a pre-existing target pickle file if loading pickled targets.
- Type
str or bytes
-
pk_load
¶ True if both pk_covariates and pk_targets are provided and these paths exist (it’s assumed they contain the correct pickled data).
- Type
bool
-
feature_sets
¶ The provided features as FeatureSetConfig objects. These contain paths to the feature files and importantly the Transform objects which contain statistics used to transform the covariates. These Transform objects and contained statistics must be maintained across workflow steps (aka CLI commands).
- Type
-
patchsize
¶ Half-width of the patches that feature data will be chunked into. Height/width of each patch is equal to patchsize * 2 + 1.
Todo
Not implemented, defaults to 1.
- Type
int
-
target_file
¶ Path to a shapefile defining the targets to be trained on.
- Type
str or bytes
-
target_property
¶ Name of the field in the target_file to be used as training property.
- Type
str
-
target_weight_property
¶ Name of the field in the target_file to be used as target weights.
- Type
str, optional
-
fields_to_write_to_csv
¶ List of field names in the target_file to be included in output table.
- Type
list(str), optional
-
shiftmap_targets
¶ Path to a shapefile containing targets to generate shiftmap from. This is optional, by default shiftmap will generate dummy targets by randomly sampling the target shapefile.
- Type
str or bytes, optional
-
spatial_resampling_args
¶ Kwargs for spatial resampling. See Resampling for more details.
- Type
dict
-
value_resampling_args
¶ Kwargs for value resampling. See Resampling for more details.
- Type
dict
-
final_transform
¶ Transforms to apply to whole image set after other preprocessing has been performed.
- Type
TransformSet
-
oos_percentage
¶ Float between 0 and 1. The percentage of targets to withhold from training to be used in out-of-sample validation.
- Type
float, optional
-
oos_shapefile
¶ Shapefile containing targets to be used in out-of-sample validation.
- Type
str or bytes, optional
-
oos_property
¶ Name of the property in oos_shapefile to be used in validation. Only required if an OOS shapefile is provided.
- Type
str
-
out_of_sample_validation
¶ True if out of sample validation is to be performed.
- Type
bool
-
rank_features
¶ True if ‘feature_ranking’ is True in ‘validation’ block of the config. Turns on feature ranking. Default is False.
- Type
bool, optional
-
permutation_importance
¶ True if ‘permutation_importance’ is True in ‘validation’ block of the config. Turns on permutation importance. Default is False.
- Type
bool
-
parallel_validate
¶ True if ‘parallel’ is present in ‘k-fold’ block of config. Turns on parallel k-fold cross validation. Default is False.
- Type
bool, optional
-
cross_validate
¶ True if ‘k-fold’ block is present in ‘validation’ block of config. Turns on k-fold cross validation.
- Type
bool, optional
-
folds
¶ The number of folds to split dataset into for cross validation. Required if
cross_validate
is True.- Type
int
-
crossval_seed
¶ Seed for random sorting of folds for cross validation. Required if
cross_validate
is True.- Type
int
-
optimisation
¶ Dictionary of optimisation arguments. See Optimisation for details.
- Type
dict
-
geotiff_options
¶ Optional creation options passed to the geotiff output driver. See https://gdal.org/drivers/raster/gtiff.html#creation-options for a list of creation options.
- Type
dict, optional
-
quantiles
¶ Prediction quantile/interval for predicted values.
- Type
float
-
outbands
¶ The outbands to write in the prediction output file. Used as the ‘stop’ for a slice taken from list of prediction tags, i.e. [0: outbands]. If the resulting slice is greater than the number of tags available, then all tags will be selected. If no value is provied, then all tags will be selected.
Todo
Having this as a slice is questionable. Should be simplified.
- Type
int
-
thumbnails
¶ Subsampling factor for thumbnails of output images. Default is 10.
- Type
int, optional
-
bootstrap_predictions
¶ Only applies if a bootstrapped algorithm is being used. This is the number of predictions to perform, by default will predict on all sub-models. E.g. if you had a BS algorithm containing 100 sub-models, you could limit a test prediction to 20 using this parameter to speed things up.
- Type
int, optional
-
mask
¶ Path to a geotiff file for masking the output prediction map. Only values that have been masked will be predicted.
- Type
str, optional
-
retain
¶ Value in the above mask that indicates cell should be retained and predicted. Must be provided if a mask is provided.
- Type
int
-
lon_lat
¶ Dictionary containing paths to longitude and latitude grids used in kriging.
- Type
dict, optional
-
output_dir
¶ Path to directory where prediciton map and other outputs will be written.
- Type
str
-
static
parse_extents
(exb)¶ Validates extents parameters.
-
set_algo_flags
()¶ Convenience method for setting boolean flags based on the algorithm being used.
-
property
tmpdir
¶ Convenience method for creating tmpdir needed by some UncoverML functionality.
-
yaml_loader
¶ alias of
yaml.loader.SafeLoader
-
exception
uncoverml.config.
ConfigException
¶ Bases:
Exception
-
class
uncoverml.config.
FeatureSetConfig
(config_dict)¶ Bases:
object
Config class representing a ‘feature set’ in the config file.
- Parameters
config_dict (
dict
) – The section of the yaml file for a feature set.
-
name
¶ Name of the feature set.
- Type
str
-
type
¶ Data type of the feature set (‘categorical’ or ‘ordinal’).
- Type
str
-
files
¶ Absolute paths to .tif files of the feature set.
- Type
list of str
-
transform_set
¶ Transforms specified for the feautre set.
- Type
uncoverml.cubist module¶
-
class
uncoverml.cubist.
Cubist
(name='temp', print_output=False, unbiased=True, max_rules=None, committee_members=1, max_categories=5000, sampling=None, seed=None, neighbors=None, feature_type=None, composite_model=False, auto=False, extrapolation=None, calc_usage=False, bootstrap=None)¶ Bases:
object
This class wraps the cubist command line tools in a scikit-learn interface. The learning phase relies on the cubist command line tools, whereas the predictions themselves are executed directly in python.
-
fit
(x, y)¶ Train the Cubist model Given a matrix of values (X) and an output vector of values (y), this method will train the cubist model and then read the training files directly as parameters of this class.
- Parameters
x (numpy.array) – X contains all of the training inputs, This should be a matrix of values, where x.shape[0] = n, where n is the number of available training points.
y (numpy.array) – y contains the output target variables for each corresponding input vector. Again we expect y.shape[0] = n.
-
predict
(x)¶ Predicts the y values that correspond to each input Just like predict_dist, this predicts the output value, given a list of inputs contained in x.
- Parameters
x (numpy.array) – The inputs for which the model should be evaluated
- Returns
y_mean – An array of expected output values given the inputs
- Return type
numpy.array
-
predict_dist
(x, interval=0.95)¶ Predict the outputs and variances of the inputs This method predicts the output values that would correspond to each input in X. This method also returns the certainty of the model in each case, which is only sensible when the number of commitee members is greater than one.
This method also outputs quantile information along with the variance to establish the probability distribution clearly.
- Parameters
x (numpy.array) – The inputs for which the model should be evaluated
interval (float) – The probability threshold for which the quantiles should be output.
- Returns
y_mean (numpy.array) – An array of expected output values given the inputs
y_var (numpy.array) – The variance of the outputs
ql (numpy.array) – The lower quantiles for each input
qu (numpy.array) – The upper quantiles for each input
-
-
class
uncoverml.cubist.
CubistReportRow
(cond, model, feature)¶ Bases:
object
convenience class for accumulating cubist report
-
class
uncoverml.cubist.
MultiCubist
(outdir='.', trees=10, print_output=False, unbiased=True, max_rules=None, committee_members=1, max_categories=5000, neighbors=None, feature_type=None, sampling=70, seed=None, extrapolation=None, composite_model=False, auto=False, parallel=False, calc_usage=False, bootstrap=None)¶ Bases:
object
This is a wrapper on Cubist.
-
calculate_usage
()¶ Averages the Cond and Model statistics of all the cubist runs
-
fit
(x, y)¶ Train the Cubist model Given a matrix of values (X) and an output vector of values (y), this method will train the cubist model and then read the training files directly as parameters of this class.
- Parameters
x (numpy.array) – X contains all of the training inputs, This should be a matrix of values, where x.shape[0] = n, where n is the number of available training points.
y (numpy.array) – y contains the output target variables for each corresponding input vector. Again we expect y.shape[0] = n.
-
predict
(x)¶ Predicts the y values that correspond to each input Just like predict_dist, this predicts the output value, given a list of inputs contained in x.
- Parameters
x (numpy.array) – The inputs for which the model should be evaluated
- Returns
y_mean – An array of expected output values given the inputs
- Return type
numpy.array
-
predict_dist
(x, interval=0.95)¶ Predict the outputs and variances of the inputs This method predicts the output values that would correspond to each input in X. This method also returns the certainty of the model in each case, which is only sensible when the number of commitee members is greater than one.
This method also outputs quantile information along with the variance to establish the probability distribution clearly.
- Parameters
x (numpy.array) – The inputs for which the model should be evaluated
interval (float) – The probability threshold for which the quantiles should be output.
- Returns
y_mean (numpy.array) – An array of expected output values given the inputs
y_var (numpy.array) – The variance of the outputs
ql (numpy.array) – The lower quantiles for each input
qu (numpy.array) – The upper quantiles for each input
-
-
class
uncoverml.cubist.
Rule
(rule, m)¶ Bases:
object
-
comparator
= {'<': <ufunc 'less'>, '<=': <ufunc 'less_equal'>, '=': <ufunc 'equal'>, '>': <ufunc 'greater'>, '>=': <ufunc 'greater_equal'>}¶
-
regress
(x, mask=None)¶
-
satisfied
(x)¶
-
-
uncoverml.cubist.
arguments
(p)¶
-
uncoverml.cubist.
cond_line
(line)¶
-
uncoverml.cubist.
mean
(numbers)¶
-
uncoverml.cubist.
pairwise
(iterable)¶
-
uncoverml.cubist.
parse_float_array
(arraystring)¶
-
uncoverml.cubist.
read_data
(filename)¶
-
uncoverml.cubist.
remove_first_line
(line)¶
-
uncoverml.cubist.
save_data
(filename, data)¶
-
uncoverml.cubist.
variance_with_mean
(mean)¶
-
uncoverml.cubist.
write_dict
(filename, dict_obj)¶
uncoverml.cubist_config module¶
uncoverml.diagnostics module¶
This module contains functionality for plotting validation scores and other diagnostic information.
-
uncoverml.diagnostics.
plot_covariate_correlation
(path, method='pearson')¶ Plots matrix of correlation between covariates.
- Parameters
path (str) – Path to ‘rawcovariates’ CSV file.
method (str, optional) – Correlation coefficient to calculate. Choices are ‘pearson’, ‘kendall’, ‘spearman’. Default is ‘pearson’.
- Returns
The matrix plot as a matplotlib Figure.
- Return type
obj:matplotlib.figure.Figure
-
uncoverml.diagnostics.
plot_covariates_x_targets
(path, cols=2, subplot_width=8, subplot_height=4)¶ Plots scatter plots of each covariate intersected with target values.
- Parameters
path (str) – Path to ‘rawcovariates’ CSV file containing intersection of targets and covariates.
cols (int, optional) – The number of columns to split the figure into. Default is 1.
subplot_width (int) – Width of each subplot in inches. Default is 8.
subplot_height (int) – Width of each subplot in inches. Default is 4.
- Returns
The scatter plots as a matplotlib Figure.
- Return type
obj:matplotlib.figure.Figure
-
uncoverml.diagnostics.
plot_feature_rank_curves
(path, subplot_width=8, subplot_height=4)¶ Plots curves for feature ranking of each metric.
- Parameters
path (str) – Path to ‘featureranks’ JSON file.
subplot_width (int, optional) – Width of each subplot. Default is 8.
subplot_height (int, optional) – Height of each subplot. Default is 4.
- Returns
The plots as a matplotlib Figure.
- Return type
obj:matplotlib.figure.Figure
-
uncoverml.diagnostics.
plot_feature_ranks
(path, barwidth=0.08, figsize=(15, 9))¶ Plots a grouped bar chart of feature rank scores, grouped by covariate. Depending on the number of covariates and metrics being calculated you may need to tweak barwidth and figsize so the bars fit.
- Parameters
path (str) – Path to JSON file containing feature ranking results.
barwidth (float, optional) – Width of the bars.
figsize (tuple(float, float), optional) – The (width, height) of the figure in inches.
- Returns
The bar chart as a matplotlib Figure.
- Return type
obj:matplotlib.figure.Figure
-
uncoverml.diagnostics.
plot_real_vs_pred_crossval
(crossval_path, scores_path=None, bins=20, overlay=False, hist_cm=None, scatter_color=None, figsize=(25, 12.5), point_size=None)¶
-
uncoverml.diagnostics.
plot_real_vs_pred_prediction
(rc_path, pred_path, scores_path=None, bins=20, overlay=False, hist_cm=None, scatter_color=None, figsize=(25, 12.5), point_size=None)¶
-
uncoverml.diagnostics.
plot_residual_error_crossval
(crossval_path, bins=20)¶
-
uncoverml.diagnostics.
plot_residual_error_prediction
(rc_path, pred_path, bins=20)¶
-
uncoverml.diagnostics.
plot_target_scaling
(path, bins=20, title='Target Scaling', sharey=False)¶ Plots histograms of target values pre and post-scaling.
- Parameters
path (str) – Path to ‘transformed_targets’ CSV file.
bins (int, optional) – The number of value bins for the histograms. Default is 20.
title (str, optional) – The title of the plot. Defaults to ‘Target Scaling’.
sharey (bool) – Whether the plots will share a y-axis and scale. Default is False.
- Returns
The histograms as a matplotlib Figure.
- Return type
obj:maplotlib.figure.Figure
uncoverml.features module¶
-
uncoverml.features.
cull_all_null_rows
(feature_sets)¶
-
uncoverml.features.
extract_features
(image_source, targets, n_subchunks, patchsize)¶ each node gets its own share of the targets, so all nodes will always have targets
-
uncoverml.features.
extract_subchunks
(image_source, subchunk_index, n_subchunks, patchsize)¶
-
uncoverml.features.
features_from_shapefile
(feature_sets, mask=None)¶
-
uncoverml.features.
gather_features
(x, node=None)¶
-
uncoverml.features.
intersect_shapefile_features
(targets, feature_sets, target_drop_values)¶ Extract covariates from a shapefile. This is done by intersecting targets with the shapefile. The shapefile must have the same number of rows as there are targets.
Drop target values here for tabular predictions. This is mainly for convenience if there are classes or points in the target file that we don’t want to predict on for whatever reason (e.g. out-of-sample validation purposes). It’s done here rather than when targets are first loaded so we don’t also have to handle a mask + targets being returned from target loading as the mask won’t be required in most situations.
- Parameters
targets (uncoverml.targets.Targets) – An uncoverml.targets.Targets object that has been loaded from a shapefile.
feature_sets (list of uncoverml.config.FeatureSetConfig) – A list of feature sets of ‘tabular’ type (sourced from shapefiles). Each set must have an attribute file that points to the shapefile to load and attribute fields which is the list of fields to retrieve as covariates from the file.
target_drop_values (list of any) – A list of values where if target observation is equal to value that row is dropped and also won’t be intersected with the covariates.
-
uncoverml.features.
remove_missing
(x, targets=None)¶
-
uncoverml.features.
save_intersected_features_and_targets
(feature_sets, transform_sets, targets, config, impute=True)¶ This function saves a table of covariate values and the target value intersected at each point. It also contains columns for UID ‘index’ and a predicted value.
If the target shapefile contains an ‘index’ field, this will be used to populate the ‘index’ column. This is intended to be used as a unique ID for each point in post-processing. If no ‘index’ field exists this column will be zero filled.
The ‘prediction’ column is for predicted values created during cross-validation. Again, this is for post-processing. It will only be populated if cross-validation is run later on. If not, it will be zero filled.
- Two files will be output:
…/output_dir/{name_of_config}_rawcovariates.csv …/output_dir/{name_of_config}_rawcovariates_mask.csv
This function will also optionally output intersected covariates scatter plot and covariate correlation matrix plot.
-
uncoverml.features.
transform_features
(feature_sets, transform_sets, final_transform, config)¶
uncoverml.filtering module¶
Code for computing the gamma sensor footprint, and for applying and unapplying spatial convolution filters to a given image.
BM: this is used in scripts/gammasensor_cli.py - I haven’t used it in my time with uncoverml or seen it used.
-
uncoverml.filtering.
fwd_filter
(img, S)¶
-
uncoverml.filtering.
inv_filter
(img, S, noise=0.001)¶
-
uncoverml.filtering.
kernel_impute
(img, S)¶
-
uncoverml.filtering.
pad2
(img)¶
-
uncoverml.filtering.
sensor_footprint
(img_w, img_h, res_x, res_y, height, mu_air)¶
uncoverml.geoio module¶
-
class
uncoverml.geoio.
ArrayImageSource
(A, origin, crs, pixsize)¶ Bases:
uncoverml.geoio.ImageSource
An image source that uses an internally stored numpy array
- Parameters
A (MaskedArray) – masked array of shape (xpix, ypix, channels) that contains the image data.
origin (ndarray) – Array of the form [lonmin, latmin] that defines the origin of the image
pixsize (ndarray) – Array of the form [pixsize_x, pixsize_y] defining the size of a pixel
-
data
(min_x, max_x, min_y, max_y)¶
-
class
uncoverml.geoio.
ImageSource
¶ Bases:
object
-
property
crs
¶
-
abstract
data
(min_x, max_x, min_y, max_y)¶
-
property
dtype
¶
-
property
full_resolution
¶
-
property
nodata_value
¶
-
property
origin_latitude
¶
-
property
origin_longitude
¶
-
property
pixsize_x
¶
-
property
pixsize_y
¶
-
property
-
class
uncoverml.geoio.
ImageWriter
(shape, bbox, crs, n_subchunks, outpath, outbands, band_tags=None, independent=False, **kwargs)¶ Bases:
object
-
close
()¶
-
nodata_value
= array(-1.e+20, dtype=float32)¶
-
output_thumbnails
(ratio=10)¶
-
write
(x, subchunk_index)¶ - Parameters
x –
subchunk_index –
independent – bool independent image writing by different processes, i.e., images are not chunked
- Returns
-
-
class
uncoverml.geoio.
RasterioImageSource
(filename)¶ Bases:
uncoverml.geoio.ImageSource
-
data
(min_x, max_x, min_y, max_y)¶
-
alias of
uncoverml.geoio.TrainingData
-
uncoverml.geoio.
crop_covariates
(config, outdir=None)¶ Crops the covariate files listed under config.feature_sets using the bounds provided under config.extents. The cropped covariates are stored in a temporary directory and the paths in config.feature_sets are redirected to theses files. The caller is responsible for removing the files once they have been created.
- Parameters
config (uncoverml.config.Config) – Parsed UncoverML config.
outdir (str) – Aboslute path to directory to store cropped covariates. If not provided, a tmp directory will be created.
-
uncoverml.geoio.
crop_mask
(config, outdir=None)¶ Crops the prediction mask listed under config.mask.
-
uncoverml.geoio.
crop_tif
(filename, extents, pixel_coordinates=False, outfile=None, strict=False)¶ Crops the geotiff using the provided extent.
- Parameters
filename (str) – Path to the geotiff to be cropped.
extents (tuple(float, float, float, float)) – Bounding box to crop by, ordering is (xmin, ymin, xmax, ymax). Data outside bounds will be cropped. Any elements that are None will be substituted with the original bound of the geotiff.
outfile (str) – Path to save cropped geotiff. If not provided, will be saved with original name + random id in tmp directory.
-
uncoverml.geoio.
distribute_targets
(positions, observations, fields)¶ Distributes a target object across all nodes
-
uncoverml.geoio.
export_feature_ranks
(measures, feats, scores, config)¶
-
uncoverml.geoio.
export_model
(model, config)¶
-
uncoverml.geoio.
feature_names
(config)¶
-
uncoverml.geoio.
get_image_bounds
(config)¶
-
uncoverml.geoio.
get_image_crs
(config)¶
-
uncoverml.geoio.
get_image_pixel_res
(config)¶
-
uncoverml.geoio.
get_image_spec
(model, config)¶
-
uncoverml.geoio.
image_feature_sets
(targets, config)¶
-
uncoverml.geoio.
image_resolutions
(config)¶
-
uncoverml.geoio.
image_subchunks
(subchunk_index, config)¶
-
uncoverml.geoio.
load_shapefile
(filename, targetfield, covariate_crs, extents)¶ TODO
-
uncoverml.geoio.
load_targets
(shapefile, targetfield=None, covariate_crs=None, extents=None)¶ Loads the shapefile onto node 0 then distributes it across all available nodes.
Important: here the concatenated targets get sorted on the root processor by position (Y,X). It’s important that this order is preserved. Once covariates are intersected with the target data, they are also in this order. This ordering is what keeps the target and feature arrays synced.
-
uncoverml.geoio.
resample
(input_tif, output_tif, ratio, resampling=5)¶ - Parameters
input_tif (str or rasterio.io.DatasetReader) – input file path or rasterio.io.DatasetReader object
output_tif (str) – output file path
ratio (float) – ratio by which to shrink/expand ratio > 1 means shrink
resampling (int, optional) – default is 5 (average) resampling. Other options are as follows: nearest = 0 bilinear = 1 cubic = 2 cubic_spline = 3 lanczos = 4 average = 5 mode = 6 gauss = 7 max = 8 min = 9 med = 10 q1 = 11 q3 = 12
-
uncoverml.geoio.
semisupervised_feature_sets
(targets, config)¶
-
uncoverml.geoio.
unsupervised_feature_sets
(config)¶
-
uncoverml.geoio.
write_shapefile_prediction
(pred, pred_tags, positions, config)¶
uncoverml.image module¶
Contains class and routines for reading chunked portions of images.
-
class
uncoverml.image.
Image
(source, chunk_idx=0, nchunks=1, overlap=0)¶ Bases:
object
Represents a raster Image. Can use to get a georeferenced chunk of an Image and the data associated with it. This class is mainly used in the
features
module for intersecting image chunks with target data and extracting the image data. It’s also used ingeoio
for getting covariate specs, such as CRS and bounds.If nchunks > 1, then the Image is striped horizontally. Chunk_idx 0 is the first strip of the image. The X range covers the full width of the image and the Y ranges from 0 to image_height / n_chunks.
- Parameters
source (
ImageSource
) – An instance of ImageSource (typically RasterioImageSource). Defines the image to be loaded.chunk_idx (int) – Which chunk of the image is being loaded.
nchunks (int) – Total number of chunks being used. This is typically set by the partitions parameter of the top level command, also set as n_subchunks on the Config object.
overlap (int) – Doesn’t seem to be used, but appears to be used for accomodating overlap in chunks (number of rows to overlap with bounding strips).
-
property
channels
¶
-
data
()¶
-
property
dtype
¶
-
in_bounds
(lonlat)¶
-
lonlat2pix
(lonlat)¶
-
property
nodata_value
¶
-
property
npoints
¶
-
patched_bbox
(patchsize)¶
-
patched_shape
(patchsize)¶
-
pix2lonlat
(xy)¶
-
property
x_range
¶
-
property
xmax
¶
-
property
xmin
¶
-
property
xres
¶
-
property
y_range
¶
-
property
ymax
¶
-
property
ymin
¶
-
property
yres
¶
-
uncoverml.image.
bbox2affine
(xmax, xmin, ymax, ymin, xres, yres)¶
-
uncoverml.image.
construct_splits
(npixels, nchunks, overlap=0)¶ Splits the image horizontally into approximately equal strips according to npixels / nchunks.
uncoverml.interpolate module¶
-
class
uncoverml.interpolate.
SKLearnCT
(fill_value=0, rescale=False, maxiter=1000, tol=0.0001)¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.RegressorMixin
- Scikit-learn wrapper for
scipy.interpolate.CloughTocher2DInterpolator class.
CloughTocher2DInterpolator(points, values, tol=1e-6)
Piecewise cubic, C1 smooth, curvature-minimizing interpolant in 2D.
New in version 0.9.
-
__call__
()¶
- Parameters
points (ndarray of floats, shape (npoints, ndims); or Delaunay) – Data point coordinates, or a precomputed Delaunay triangulation.
values (ndarray of float or complex, shape (npoints, ..)) – Data values.
fill_value (float, optional) – Value used to fill in for requested points outside of the convex hull of the input points. If not provided, then the default is
nan
.tol (float, optional) – Absolute/relative tolerance for gradient estimation.
maxiter (int, optional) – Maximum number of iterations in gradient estimation.
rescale (bool, optional) – Rescale points to unit cube before performing interpolation. This is useful if some of the input dimensions have incommensurable units and differ by many orders of magnitude.
Notes
The interpolant is constructed by triangulating the input data with Qhull [1]_, and constructing a piecewise cubic interpolating Bezier polynomial on each triangle, using a Clough-Tocher scheme [CT]. The interpolant is guaranteed to be continuously differentiable.
The gradients of the interpolant are chosen so that the curvature of the interpolating surface is approximatively minimized. The gradients necessary for this are estimated using the global algorithm described in [Nielson83,Renka84]_.
References
- CT
See, for example, P. Alfeld, ‘’A trivariate Clough-Tocher scheme for tetrahedral data’’. Computer Aided Geometric Design, 1, 169 (1984); G. Farin, ‘’Triangular Bernstein-Bezier patches’’. Computer Aided Geometric Design, 3, 83 (1986).
- Nielson83
G. Nielson, ‘’A method for interpolating scattered data based upon a minimum norm network’’. Math. Comp., 40, 253 (1983).
- Renka84
R. J. Renka and A. K. Cline. ‘’A Triangle-based C1 interpolation method.’’, Rocky Mountain J. Math., 14, 223 (1984).
-
fit
(X, y)¶
-
predict
(X)¶
-
class
uncoverml.interpolate.
SKLearnLinearNDInterpolator
(fill_value=0, rescale=False)¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.RegressorMixin
- Scikit-learn wrapper for
scipy.interpolate.LinearNDInterpolator class.
LinearNDInterpolator(points, values, fill_value=np.nan, rescale=False)
Piecewise linear interpolant in N dimensions.
New in version 0.9.
-
__call__
()¶
- Parameters
points (ndarray of floats, shape (npoints, ndims); or Delaunay) – Data point coordinates, or a precomputed Delaunay triangulation.
values (ndarray of float or complex, shape (npoints, ..)) – Data values.
fill_value (float, optional) – Value used to fill in for requested points outside of the convex hull of the input points. If not provided, then the default is
nan
.rescale (bool, optional) – Rescale points to unit cube before performing interpolation. This is useful if some of the input dimensions have incommensurable units and differ by many orders of magnitude.
Notes
The interpolant is constructed by triangulating the input data with Qhull [1]_, and on each triangle performing linear barycentric interpolation.
References
-
fit
(X, y)¶
-
predict
(X)¶
-
class
uncoverml.interpolate.
SKLearnNearestNDInterpolator
(rescale=False, tree_options=None)¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.RegressorMixin
- Scikit-learn wrapper for
scipy.interpolate.NearestNDInterpolator class.
NearestNDInterpolator(x, y)
Nearest-neighbour interpolation in N dimensions.
New in version 0.9.
-
__call__
()¶
- Parameters
x ((Npoints, Ndims) ndarray of floats) – Data point coordinates.
y ((Npoints,) ndarray of float or complex) – Data values.
rescale (boolean, optional) –
Rescale points to unit cube before performing interpolation. This is useful if some of the input dimensions have incommensurable units and differ by many orders of magnitude.
New in version 0.14.0.
tree_options (dict, optional) –
Options passed to the underlying
cKDTree
.New in version 0.17.0.
Notes
Uses
scipy.spatial.cKDTree
-
fit
(X, y)¶
-
predict
(X)¶
-
class
uncoverml.interpolate.
SKLearnRbf
(function='multiquadric', smooth=0, norm='euclidean')¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.RegressorMixin
Scikit-learn wrapper for scipy.interpolate.Rbf class.
Rbf(*args)
A class for radial basis function approximation/interpolation of n-dimensional scattered data.
- Parameters
*args (arrays) – x, y, z, …, d, where x, y, z, … are the coordinates of the nodes and d is the array of values at the nodes
function (str or callable, optional) –
The radial basis function, based on the radius, r, given by the norm (default is Euclidean distance); the default is ‘multiquadric’:
'multiquadric': sqrt((r/self.epsilon)**2 + 1) 'inverse': 1.0/sqrt((r/self.epsilon)**2 + 1) 'gaussian': exp(-(r/self.epsilon)**2) 'linear': r 'cubic': r**3 'quintic': r**5 'thin_plate': r**2 * log(r)
If callable, then it must take 2 arguments (self, r). The epsilon parameter will be available as self.epsilon. Other keyword arguments passed in will be available as well.
epsilon (float, optional) – Adjustable constant for gaussian or multiquadrics functions - defaults to approximate average distance between nodes (which is a good start).
smooth (float, optional) – Values greater than zero increase the smoothness of the approximation. 0 is for interpolation (default), the function will always go through the nodal points in this case.
norm (str, callable, optional) – A function that returns the ‘distance’ between two points, with inputs as arrays of positions (x, y, z, …), and an output as an array of distance. E.g., the default: ‘euclidean’, such that the result is a matrix of the distances from each point in
x1
to each point inx2
. For more options, see documentation of scipy.spatial.distances.cdist.
-
N
¶ The number of data points (as determined by the input arrays).
- Type
int
-
di
¶ The 1-D array of data values at each of the data coordinates xi.
- Type
ndarray
-
xi
¶ The 2-D array of data coordinates.
- Type
ndarray
-
function
¶ The radial basis function. See description under Parameters.
- Type
str or callable
-
epsilon
¶ Parameter used by gaussian or multiquadrics functions. See Parameters.
- Type
float
-
smooth
¶ Smoothing parameter. See description under Parameters.
- Type
float
-
norm
¶ The distance function. See description under Parameters.
- Type
str or callable
-
nodes
¶ A 1-D array of node values for the interpolation.
- Type
ndarray
-
A
¶ - Type
internal property, do not use
Examples
>>> from scipy.interpolate import Rbf >>> x, y, z, d = np.random.rand(4, 50) >>> rbfi = Rbf(x, y, z, d) # radial basis function interpolator instance >>> xi = yi = zi = np.linspace(0, 1, 20) >>> di = rbfi(xi, yi, zi) # interpolated values >>> di.shape (20,)
-
fit
(X, y)¶
-
predict
(X)¶
uncoverml.krige module¶
-
class
uncoverml.krige.
Krige
(method='ordinary', variogram_model='linear', nlags=6, weight=False, n_closest_points=10, verbose=False)¶ Bases:
uncoverml.models.TagsMixin
,sklearn.base.RegressorMixin
,sklearn.base.BaseEstimator
,uncoverml.krige.KrigePredictDistMixin
A scikitlearn wrapper class for Ordinary and Universal Kriging. This works for both Grid/RandomSearchCv for optimising the Krige parameters.
-
fit
(x, y, *args, **kwargs)¶ - Parameters
x (ndarray) – array of Points, (x, y) pairs
y (ndarray) – array of targets
-
predict
(x, *args, **kwargs)¶ - Parameters
x (ndarray) –
Returns –
------- –
array (Prediction) –
-
-
class
uncoverml.krige.
KrigePredictDistMixin
¶ Bases:
object
Mixin class for providing a
predict_dist
method to the Krige class.This is especially for use with PyKrige Ordinary/UniversalKriging classes.
-
predict_dist
(x, interval=0.95, *args, **kwargs)¶ Predictive mean and variance for a probabilistic regressor.
- Parameters
x (ndarray) – (Ns, 2) array query dataset (Ns samples, 2 dimensions).
interval (float, optional) – The percentile confidence interval (e.g. 95%) to return.
- Returns
prediction (ndarray) – The expected value of ys for the query inputs, X of shape (Ns,).
variance (ndarray) – The expected variance of ys (excluding likelihood noise terms) for the query inputs, X of shape (Ns,).
ql (ndarray) – The lower end point of the interval with shape (Ns,)
qu (ndarray) – The upper end point of the interval with shape (Ns,)
-
-
class
uncoverml.krige.
MLKrige
(ml_method, ml_params={}, *args, **kwargs)¶ Bases:
object
-
class
uncoverml.krige.
MLKrigeBase
(ml_method, ml_params={}, method='ordinary', variogram_model='linear', n_closest_points=10, nlags=6, weight=False, verbose=False)¶ Bases:
uncoverml.models.TagsMixin
This is an implementation of Regression-Kriging as described here: https://en.wikipedia.org/wiki/Regression-Kriging
-
fit
(x, y, lon_lat, *args, **kwargs)¶ Fit the ML method and also Krige the residual.
- Parameters
x (ndarray) – (Nt, d) array query dataset (Ns samples, d dimensions) for ML regression
y (ndarray) – array of targets (Nt, )
lon_lat – ndarray of (x, y) points. Needs to be a (Nt, 2) array corresponding to the lon/lat, for example.
-
krige_residual
(lon_lat)¶ - Parameters
lon_lat – ndarray of (x, y) points. Needs to be a (Ns, 2) array corresponding to the lon/lat, for example.
- Returns
- residual: ndarray
kriged residual values
-
ml_prediction
(x, *args, **kwargs)¶ - Parameters
x (ndarray) – regression matrix
- Returns
ndarray
machine learning prediction
-
predict
(x, lon_lat, *args, **kwargs)¶ Must override predict_dist method of Krige. Predictive mean and variance for a probabilistic regressor.
- Parameters
x (ndarray) – (Ns, d) array query dataset (Ns samples, d dimensions) for ML regression
lon_lat – ndarray of (x, y) points. Needs to be a (Ns, 2) array corresponding to the lon/lat, for example.
- Returns
pred – The expected value of ys for the query inputs, X of shape (Ns,).
- Return type
ndarray
-
score
(x, y, lon_lat, sample_weight=None)¶ Overloading default regression score method
-
-
class
uncoverml.krige.
MLKrigePredictDistMixin
¶ Bases:
object
-
predict_dist
(x, interval=0.95, lon_lat=None, *args, **kwargs)¶ Predictive mean, variance, lower and upper quantile for a probabilistic regressor.
- Parameters
X (ndarray) – (Ns, d) array query dataset (Ns samples, d dimensions) for ML regression
lon_lat – ndarray of (x, y) points. Needs to be a (Ns, 2) array corresponding to the lon/lat, for example.
interval (float, optional) – The percentile confidence interval (e.g. 95%) to return.
must contain a key lon_lat (kwargs) –
needs to be a (Ns (which) –
array (2)) –
to the lon/lat (corresponding) –
- Returns
pred (ndarray) – The expected value of ys for the query inputs, X of shape (Ns,).
var (ndarray) – The expected variance of ys (excluding likelihood noise terms) for the query inputs, X of shape (Ns,).
ql (ndarray) – The lower end point of the interval with shape (Ns,)
qu (ndarray) – The upper end point of the interval with shape (Ns,)
-
-
class
uncoverml.krige.
MLKrigePreidctDist
(*args, **kwargs)¶ Bases:
uncoverml.krige.MLKrigeBase
,uncoverml.krige.MLKrigePredictDistMixin
uncoverml.learn module¶
Handles calling learning methods on models.
-
uncoverml.learn.
local_learn_model
(x_all, targets_all, config)¶ Trains a model. Handles special case of parallel models.
- Parameters
x_all (np.ndarray) – All covariate data, shape (n_samples, n_features), sorted using X, Y of target positions.
targets_all (np.ndarray) – All target data, shape (n_samples), sorted using X, Y of target positions.
config (
Config
) – Config object.
- Returns
A trained Model.
- Return type
Model
uncoverml.likelihoods module¶
Likelihood functions that can be used with revrand.
Can be used with revrand’s GeneralisedLinearModel class for specialised regression tasks such as basement depth estimation from censored and uncensored depth observations.
-
class
uncoverml.likelihoods.
Switching
(lenscale=1.0, var_init=Parameter(value=1.0, bounds=Positive(upper=None), shape=()))¶ Bases:
revrand.likelihoods.Bernoulli
-
Ey
(f, var, z)¶ Expected value of the Bernoulli likelihood.
- Parameters
f (ndarray) – latent function from the GLM prior (\(\mathbf{f} = \boldsymbol\Phi \mathbf{w}\))
- Returns
Ey – expected value of y, \(\mathbb{E}[\mathbf{y}|\mathbf{f}]\).
- Return type
ndarray
-
cdf
(y, f, var, z)¶ Cumulative density function of the likelihood.
- Parameters
y (ndarray) – query quantiles, i.e. \(P(Y \leq y)\).
f (ndarray) – latent function from the GLM prior (\(\mathbf{f} = \boldsymbol\Phi \mathbf{w}\))
- Returns
cdf – Cumulative density function evaluated at y.
- Return type
ndarray
-
df
(y, f, var, z)¶ Derivative of Bernoulli log likelihood w.r.t. f.
- Parameters
y (ndarray) – array of 0, 1 valued integers of targets
f (ndarray) – latent function from the GLM prior (\(\mathbf{f} = \boldsymbol\Phi \mathbf{w}\))
- Returns
df – the derivative \(\partial \log p(y|f) / \partial f\)
- Return type
ndarray
-
dp
(y, f, var, z)¶ Derivative of Bernoulli log likelihood w.r.t.the parameters, \(\theta\).
- Parameters
y (ndarray) – array of 0, 1 valued integers of targets
f (ndarray) – latent function from the GLM prior (\(\mathbf{f} = \boldsymbol\Phi \mathbf{w}\))
- Returns
dp – the derivative \(\partial \log p(y|f, \theta)/ \partial \theta\) for each parameter. If there is only one parameter, this is not a list.
- Return type
list, float or ndarray
-
loglike
(y, f, var, z)¶ Bernoulli log likelihood.
- Parameters
y (ndarray) – array of 0, 1 valued integers of targets
f (ndarray) – latent function from the GLM prior (\(\mathbf{f} = \boldsymbol\Phi \mathbf{w}\))
- Returns
logp – the log likelihood of each y given each f under this likelihood.
- Return type
ndarray
-
-
class
uncoverml.likelihoods.
UnifGauss
(lenscale=1.0)¶ Bases:
revrand.likelihoods.Bernoulli
-
Ey
(f)¶ Expected value of the Bernoulli likelihood.
- Parameters
f (ndarray) – latent function from the GLM prior (\(\mathbf{f} = \boldsymbol\Phi \mathbf{w}\))
- Returns
Ey – expected value of y, \(\mathbb{E}[\mathbf{y}|\mathbf{f}]\).
- Return type
ndarray
-
cdf
(y, f)¶ Cumulative density function of the likelihood.
- Parameters
y (ndarray) – query quantiles, i.e. \(P(Y \leq y)\).
f (ndarray) – latent function from the GLM prior (\(\mathbf{f} = \boldsymbol\Phi \mathbf{w}\))
- Returns
cdf – Cumulative density function evaluated at y.
- Return type
ndarray
-
df
(y, f)¶ Derivative of Bernoulli log likelihood w.r.t. f.
- Parameters
y (ndarray) – array of 0, 1 valued integers of targets
f (ndarray) – latent function from the GLM prior (\(\mathbf{f} = \boldsymbol\Phi \mathbf{w}\))
- Returns
df – the derivative \(\partial \log p(y|f) / \partial f\)
- Return type
ndarray
-
loglike
(y, f)¶ Bernoulli log likelihood.
- Parameters
y (ndarray) – array of 0, 1 valued integers of targets
f (ndarray) – latent function from the GLM prior (\(\mathbf{f} = \boldsymbol\Phi \mathbf{w}\))
- Returns
logp – the log likelihood of each y given each f under this likelihood.
- Return type
ndarray
-
pdf
(y, f)¶
-
uncoverml.metadata_profiler module¶
- Description:
Gather Metadata for the uncover-ml prediction output results:
Reference: email 2019-05-24 Overview Creator: (person who generated the model) Model;
Name: Type and date: Algorithm: Extent: Lat/long - location on Australia map?
SB Notes: None of the above is required as this information will be captured in the yaml file.
Model inputs:
Covariates - list (in full)
2. Targets: path to shapefile: csv file SB Notes: Only covaraite list file. Targets and path to shapefile is not required as this is available in the yaml file. May be the full path to the shapefile has some merit as one can specify partial path.
- Model performance
JSON file (in full)
SB Notes: Yes
Model outputs
Prediction grid including path
Quantiles Q5; Q95
Variance:
Entropy:
Feature rank file
Raw covariates file (target value - covariate value)
Optimisation output
8. Others ?? SB Notes: Not required as these are model dependent, and the metadata will be contained in each of the output geotif file.
Model parameters: 1. YAML file (in full) 2. .SH file (in full) SB Notes: The .sh file is not required. YAML file is read as a python dictionary in uncoverml which can be dumped in the metadata.
CreationDate: 31/05/19 Developer: fei.zhang@ga.gov.au
- Revision History:
LastUpdate: 31/05/19 FZ LastUpdate: dd/mm/yyyy Who Optional description
-
class
uncoverml.metadata_profiler.
MetadataSummary
(model, config)¶ Bases:
object
Summary Description of the ML prediction output
-
write_metadata
(out_filename)¶ write the metadata for this prediction result, into a human-readable txt file. in order to make the ML results traceable and reproduceable (provenance)
-
uncoverml.mllog module¶
Logging config.
-
class
uncoverml.mllog.
MPIStreamHandler
(stream=None)¶ Bases:
logging.StreamHandler
If message stars with ‘:mpi:’, the message will be logged regardless of node (the ‘:mpi:’ will be removed from the message). Otherwise, only node 0 will emit messages.
-
emit
(record)¶ Emit a record.
If a formatter is specified, it is used to format the record. The record is then written to the stream with a trailing newline. If exception information is present, it is formatted using traceback.print_exception and appended to the stream. If the stream has an ‘encoding’ attribute, it is used to determine how to do the output to the stream.
-
-
uncoverml.mllog.
configure
(verbosity)¶
-
uncoverml.mllog.
handle_exception
(exc_type, exc_value, exc_traceback)¶ Add MPI index to exception traceback.
-
uncoverml.mllog.
warn_with_traceback
(message, category, filename, lineno, line=None)¶ copied from: http://stackoverflow.com/questions/22373927/get-traceback-of-warnings
uncoverml.models module¶
Model Objects and ML algorithm serialisation.
This module makes many of the models in scikit learn and revrand available to our pipeline, as well as augmenting their functionality with, for examples, target transformations.
This table is a quick breakdown of the advantages and disadvantages of the various algorithms we can use in this pipeline.
Algorithm |
Learning Scalability |
Modelling Capacity |
Prediction Speed |
Probabilistic |
---|---|---|---|---|
Bayesian linear regression |
+ + + |
+ |
+ + + + |
Yes |
Approx. Gaussian process |
+ + |
+ + + + |
+ + + + |
Yes |
SGD linear regression |
+ + + + |
+ |
+ + + |
Yes |
SGD Gaussian process |
+ + + + |
+ + + + |
+ + + |
Yes |
Support Vector Regression |
+ |
+ + + + |
+ |
No |
Random Forest Regression |
+ + + |
+ + + + |
+ + |
Pseudo |
Cubist Regression |
+ + + |
+ + + + |
+ + |
Pseudo |
ARD Regression |
+ + |
+ + |
+ + + |
No |
Extremely Randomized Reg. |
+ + + |
+ + + + |
+ + |
No |
Decision Tree Regression |
+ + + |
+ + + |
+ + + + |
No |
-
class
uncoverml.models.
ARDRegressionTransformed
(target_transform='identity', *args, **kwargs)¶ Bases:
uncoverml.models.transform_targets.<locals>.TransformedRegressor
,uncoverml.models.TagsMixin
ARD regression.
-
class
uncoverml.models.
ApproxGP
(kernel='rbf', nbases=50, lenscale=1.0, var=1.0, regulariser=1.0, ard=True, tol=1e-08, maxiter=1000, nstarts=100)¶ Bases:
uncoverml.models.BasisMakerMixin
,revrand.slm.StandardLinearModel
,uncoverml.models.PredictDistMixin
,uncoverml.models.MutualInfoMixin
An approximate Gaussian process for medium scale data.
- Parameters
kernel (str, optional) – the (approximate) kernel to use with this Gaussian process. Have a look at
basismap
dictionary for appropriate kernel approximations.nbases (int) – how many unique random bases to create (twice this number will be actually created, i.e. real and imaginary components for each base). The higher this number, the more accurate the kernel approximation, but the longer the runtime of the algorithm. Usually if X is high dimensional, this will have to also be high dimensional.
lenscale (float, optional) – the initial value for the kernel length scale to be learned.
ard (bool, optional) – Whether to use a different length scale for each dimension of X or a single length scale. This will result in a longer run time, but potentially better results.
var (Parameter, optional) – observation variance initial value.
regulariser (Parameter, optional) – weight regulariser (variance) initial value.
tol (float, optional) – optimiser function tolerance convergence criterion.
maxiter (int, optional) – maximum number of iterations for the optimiser.
nstarts (int, optional) – if there are any parameters with distributions as initial values, this determines how many random candidate starts shoulds be evaluated before commencing optimisation at the best candidate.
-
class
uncoverml.models.
ApproxGPTransformed
(target_transform='identity', *args, **kwargs)¶ Bases:
uncoverml.models.transform_targets.<locals>.TransformedRegressor
,uncoverml.models.TagsMixin
Approximate Gaussian process.
-
class
uncoverml.models.
BasisMakerMixin
¶ Bases:
object
Mixin class for easily creating approximate kernel functions for revrand.
This is primarily used for the approximate Gaussian process algorithms.
-
fit
(X, y, *args, **kwargs)¶
-
-
class
uncoverml.models.
BootstrappedSVR
(n_models=100, parallel=True, *args, **kwargs)¶ Bases:
uncoverml.models.bootstrap_model.<locals>.BootstrappedModel
,uncoverml.models.TagsMixin
-
class
uncoverml.models.
CubistMultiTransformed
(target_transform='identity', *args, **kwargs)¶ Bases:
uncoverml.models.transform_targets.<locals>.TransformedRegressor
,uncoverml.models.TagsMixin
Parallel Cubist regression (wrapper).
-
class
uncoverml.models.
CubistTransformed
(target_transform='identity', *args, **kwargs)¶ Bases:
uncoverml.models.transform_targets.<locals>.TransformedRegressor
,uncoverml.models.TagsMixin
Cubist regression (wrapper).
-
class
uncoverml.models.
CustomKNeighborsRegressor
(n_neighbors=10, weights='distance', algorithm='auto', leaf_size=30, metric='minkowski', p=2, metric_params=None, n_jobs=1, min_distance=0.0)¶ Bases:
sklearn.neighbors._regression.KNeighborsRegressor
-
class
uncoverml.models.
DecisionTreeTransformed
(target_transform='identity', *args, **kwargs)¶ Bases:
uncoverml.models.transform_targets.<locals>.TransformedRegressor
,uncoverml.models.TagsMixin
Decision tree regression.
-
class
uncoverml.models.
ExtraTreeTransformed
(target_transform='identity', *args, **kwargs)¶ Bases:
uncoverml.models.transform_targets.<locals>.TransformedRegressor
,uncoverml.models.TagsMixin
Extremely randomised tree regressor.
-
class
uncoverml.models.
GLMPredictDistMixin
¶ Bases:
object
Mixin class for providing a
predict_dist
method to the GeneralisedLinearModel class in revrand.This is especially for use with Gaussian likelihood models.
-
predict_dist
(X, interval=0.95, *args, **kwargs)¶ Predictive mean and variance for a probabilistic regressor.
- Parameters
X (ndarray) – (Ns, d) array query dataset (Ns samples, d dimensions).
interval (float, optional) – The percentile confidence interval (e.g. 95%) to return.
fields (dict, optional) – dictionary of fields parsed from the shape file.
indicator_field
should be a key in this dictionary. If this is not present, then a Gaussian likelihood will be used for all predictions. The only time this may be input if for cross validation.
- Returns
Ey (ndarray) – The expected value of ys for the query inputs, X of shape (Ns,).
Vy (ndarray) – The expected variance of ys (excluding likelihood noise terms) for the query inputs, X of shape (Ns,).
ql (ndarray) – The lower end point of the interval with shape (Ns,)
qu (ndarray) – The upper end point of the interval with shape (Ns,)
-
-
class
uncoverml.models.
GradBoostedTrees
(*args, **kwargs)¶ Bases:
uncoverml.models.encode_targets.<locals>.EncodedClassifier
,uncoverml.models.TagsMixin
Gradient Boosted Trees multi-class classification.
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html
-
class
uncoverml.models.
KNearestNeighborTransformed
(target_transform='identity', *args, **kwargs)¶ Bases:
uncoverml.models.transform_targets.<locals>.TransformedRegressor
,uncoverml.models.TagsMixin
K Nearest Neighbour Regression
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html
-
class
uncoverml.models.
LinearReg
(onescol=True, var=1.0, regulariser=1.0, tol=1e-08, maxiter=1000, nstarts=100)¶ Bases:
revrand.slm.StandardLinearModel
,uncoverml.models.PredictDistMixin
,uncoverml.models.MutualInfoMixin
Bayesian standard linear model.
- Parameters
onescol (bool, optional) – If true, prepend a column of ones onto X (i.e. a bias term)
var (Parameter, optional) – observation variance initial value.
regulariser (Parameter, optional) – weight regulariser (variance) initial value.
tol (float, optional) – optimiser function tolerance convergence criterion.
maxiter (int, optional) – maximum number of iterations for the optimiser.
nstarts (int, optional) – if there are any parameters with distributions as initial values, this determines how many random candidate starts shoulds be evaluated before commencing optimisation at the best candidate.
-
class
uncoverml.models.
LinearRegTransformed
(target_transform='identity', *args, **kwargs)¶ Bases:
uncoverml.models.transform_targets.<locals>.TransformedRegressor
,uncoverml.models.TagsMixin
Bayesian linear regression.
-
class
uncoverml.models.
LogisticClassifier
(*args, **kwargs)¶ Bases:
uncoverml.models.encode_targets.<locals>.EncodedClassifier
,uncoverml.models.TagsMixin
Logistic Regression for muli-class classification.
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
-
class
uncoverml.models.
LogisticRBF
(*args, **kwargs)¶ Bases:
uncoverml.models.encode_targets.<locals>.EncodedClassifier
,uncoverml.models.TagsMixin
Approximate large scale kernel logistic regression.
-
class
uncoverml.models.
MaskRows
(*Xs)¶ Bases:
object
-
apply_mask
(X)¶
-
apply_masks
(*Xs)¶
-
static
get_complete_rows
(X)¶
-
trim_mask
(X)¶
-
trim_masks
(*Xs)¶
-
-
class
uncoverml.models.
MultiRandomForestTransformed
(target_transform='identity', *args, **kwargs)¶ Bases:
uncoverml.models.transform_targets.<locals>.TransformedRegressor
,uncoverml.models.TagsMixin
MPI implementation of Random forest regression with forest grown on many CPUS.
-
class
uncoverml.models.
MutualInfoMixin
¶ Bases:
object
Mixin class for providing predictive entropy reduction functionality to the StandardLinearModel class (only).
-
entropy_reduction
(X)¶ Predictice entropy reduction (a.k.a mutual information).
Estimate the reduction in the posterior distribution’s entropy (i.e. model uncertainty reduction) as a result of including a particular observation.
- Parameters
X (ndarray) – (Ns, d) array query dataset (Ns samples, d dimensions).
- Returns
MI – Prediction of mutual information (expected reduiction in posterior entrpy) assocated with each query input. The units are ‘nats’, and the shape of the returned array is (Ns,).
- Return type
ndarray
-
-
class
uncoverml.models.
PredictDistMixin
¶ Bases:
object
Mixin class for providing a
predict_dist
method to the StandardLinearModel class in revrand.-
predict_dist
(X, interval=0.95, *args, **kwargs)¶ Predictive mean and variance for a probabilistic regressor.
- Parameters
X (ndarray) – (Ns, d) array query dataset (Ns samples, d dimensions).
interval (float, optional) – The percentile confidence interval (e.g. 95%) to return.
fields (dict, optional) – dictionary of fields parsed from the shape file.
indicator_field
should be a key in this dictionary. If this is not present, then a Gaussian likelihood will be used for all predictions. The only time this may be input if for cross validation.
- Returns
Ey (ndarray) – The expected value of ys for the query inputs, X of shape (Ns,).
Vy (ndarray) – The expected variance of ys (excluding likelihood noise terms) for the query inputs, X of shape (Ns,).
ql (ndarray) – The lower end point of the interval with shape (Ns,)
qu (ndarray) – The upper end point of the interval with shape (Ns,)
-
-
class
uncoverml.models.
RandomForestClassifier
(*args, **kwargs)¶ Bases:
uncoverml.models.encode_targets.<locals>.EncodedClassifier
,uncoverml.models.TagsMixin
Random Forest for muli-class classification.
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
-
class
uncoverml.models.
RandomForestRegressor
(n_estimators=100, criterion='mse', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, ccp_alpha=0.0, max_samples=None)¶ Bases:
sklearn.ensemble._forest.RandomForestRegressor
Implements a “probabilistic” output by looking at the variance of the decision tree estimator ouputs.
-
predict_dist
(X, interval=0.95)¶
-
-
class
uncoverml.models.
RandomForestRegressorMulti
(outdir='.', forests=10, parallel=True, n_estimators=10, random_state=1, **kwargs)¶ Bases:
object
-
fit
(x, y, *args, **kwargs)¶
-
predict
(x)¶
-
predict_dist
(x, interval=0.95, *args, **kwargs)¶
-
-
class
uncoverml.models.
RandomForestTransformed
(target_transform='identity', *args, **kwargs)¶ Bases:
uncoverml.models.transform_targets.<locals>.TransformedRegressor
,uncoverml.models.TagsMixin
Random forest regression.
-
class
uncoverml.models.
SGDApproxGP
(kernel='rbf', nbases=50, lenscale=1.0, var=1.0, regulariser=1.0, ard=True, maxiter=3000, batch_size=10, alpha=0.01, beta1=0.9, beta2=0.99, epsilon=1e-08, random_state=1, nstarts=500)¶ Bases:
uncoverml.models.BasisMakerMixin
,revrand.glm.GeneralisedLinearModel
,uncoverml.models.GLMPredictDistMixin
An approximate Gaussian process for large scale data using stochastic gradients.
This uses the Adam stochastic gradients algorithm; http://arxiv.org/pdf/1412.6980
- Parameters
kern (str, optional) – the (approximate) kernel to use with this Gaussian process. Have a look at
basismap
dictionary for appropriate kernel approximations.nbases (int) – how many unique random bases to create (twice this number will be actually created, i.e. real and imaginary components for each base). The higher this number, the more accurate the kernel approximation, but the longer the runtime of the algorithm. Usually if X is high dimensional, this will have to also be high dimensional.
lenscale (float, optional) – the initial value for the kernel length scale to be learned.
ard (bool, optional) – Whether to use a different length scale for each dimension of X or a single length scale. This will result in a longer run time, but potentially better results.
var (float, optional) – observation variance initial value.
regulariser (float, optional) – weight regulariser (variance) initial value.
maxiter (int, optional) – Number of iterations to run for the stochastic gradients algorithm.
batch_size (int, optional) – number of observations to use per SGD batch.
alpha (float, optional) – stepsize to give the stochastic gradient optimisation update.
beta1 (float, optional) – smoothing/decay rate parameter for the stochastic gradient, must be [0, 1].
beta2 (float, optional) – smoothing/decay rate parameter for the squared stochastic gradient, must be [0, 1].
epsilon (float, optional) – “jitter” term to ensure continued learning in stochastic gradients (should be small).
random_state (int or RandomState, optional) – random seed
nstarts (int, optional) – if there are any parameters with distributions as initial values, this determines how many random candidate starts shoulds be evaluated before commencing optimisation at the best candidate.
Note
Setting the
random_state
may be important for getting consistent looking predictions when many chunks/subchunks are used. This is because the predictive distribution is sampled for these algorithms!
-
class
uncoverml.models.
SGDApproxGPTransformed
(target_transform='identity', *args, **kwargs)¶ Bases:
uncoverml.models.transform_targets.<locals>.TransformedRegressor
,uncoverml.models.TagsMixin
Approximate Gaussian processes with stochastic gradients.
-
class
uncoverml.models.
SGDLinearReg
(onescol=True, var=1.0, regulariser=1.0, maxiter=3000, batch_size=10, alpha=0.01, beta1=0.9, beta2=0.99, epsilon=1e-08, random_state=None, nstarts=500)¶ Bases:
revrand.glm.GeneralisedLinearModel
,uncoverml.models.GLMPredictDistMixin
Bayesian standard linear model, using stochastic gradients.
This uses the Adam stochastic gradients algorithm; http://arxiv.org/pdf/1412.6980
- Parameters
onescol (bool, optional) – If true, prepend a column of ones onto X (i.e. a bias term)
var (Parameter, optional) – observation variance initial value.
regulariser (Parameter, optional) – weight regulariser (variance) initial value.
maxiter (int, optional) – Number of iterations to run for the stochastic gradients algorithm.
batch_size (int, optional) – number of observations to use per SGD batch.
alpha (float, optional) – stepsize to give the stochastic gradient optimisation update.
beta1 (float, optional) – smoothing/decay rate parameter for the stochastic gradient, must be [0, 1].
beta2 (float, optional) – smoothing/decay rate parameter for the squared stochastic gradient, must be [0, 1].
epsilon (float, optional) – “jitter” term to ensure continued learning in stochastic gradients (should be small).
random_state (int or RandomState, optional) – random seed
nstarts (int, optional) – if there are any parameters with distributions as initial values, this determines how many random candidate starts shoulds be evaluated before commencing optimisation at the best candidate.
Note
Setting the
random_state
may be important for getting consistent looking predictions when many chunks/subchunks are used. This is because the predictive distribution is sampled for these algorithms!
-
class
uncoverml.models.
SGDLinearRegTransformed
(target_transform='identity', *args, **kwargs)¶ Bases:
uncoverml.models.transform_targets.<locals>.TransformedRegressor
,uncoverml.models.TagsMixin
Baysian linear regression with stochastic gradients.
-
class
uncoverml.models.
SVRTransformed
(target_transform='identity', *args, **kwargs)¶ Bases:
uncoverml.models.transform_targets.<locals>.TransformedRegressor
,uncoverml.models.TagsMixin
Support vector machine.
-
class
uncoverml.models.
SupportVectorClassifier
(*args, **kwargs)¶ Bases:
uncoverml.models.encode_targets.<locals>.EncodedClassifier
,uncoverml.models.TagsMixin
Support Vector Machine multi-class classification.
http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
-
class
uncoverml.models.
TagsMixin
¶ Bases:
object
Mixin class to aid a pipeline in establishing the types of predictive outputs to be expected from the ML algorithms in this module.
Get the types of prediction outputs from this algorithm.
- Returns
of strings with the types of outputs that can be returned by this algorithm. This depends on the prediction methods implemented (e.g.
predict
, predict_dist`,entropy_reduction
).- Return type
list
-
class
uncoverml.models.
TransformedCTInterpolator
(target_transform='identity', *args, **kwargs)¶ Bases:
uncoverml.models.transform_targets.<locals>.TransformedRegressor
,uncoverml.models.TagsMixin
-
class
uncoverml.models.
TransformedLinearNDInterpolator
(target_transform='identity', *args, **kwargs)¶ Bases:
uncoverml.models.transform_targets.<locals>.TransformedRegressor
,uncoverml.models.TagsMixin
-
class
uncoverml.models.
TransformedNearestNDInterpolator
(target_transform='identity', *args, **kwargs)¶ Bases:
uncoverml.models.transform_targets.<locals>.TransformedRegressor
,uncoverml.models.TagsMixin
-
class
uncoverml.models.
TransformedRbfInterpolator
(target_transform='identity', *args, **kwargs)¶ Bases:
uncoverml.models.transform_targets.<locals>.TransformedRegressor
,uncoverml.models.TagsMixin
-
uncoverml.models.
apply_masked
(func, data, *args, **kwargs)¶
-
uncoverml.models.
apply_multiple_masked
(func, data, *args, **kwargs)¶
-
uncoverml.models.
bootstrap_model
(model)¶
-
uncoverml.models.
encode_targets
(Classifier)¶
-
uncoverml.models.
kernelize
(classifier)¶
-
uncoverml.models.
transform_targets
(Regressor)¶ Factory function that add’s target transformation capabiltiy to compatible scikit learn objects.
Look at the
transformers.py
module for more information on valid target transformers.Example
>>> svr = transform_targets(SVR)(target_transform='Standardise', gamma=0.1)
uncoverml.mpiops module¶
-
uncoverml.mpiops.
chunk_index
= 0¶ the index (from zero) of this node in the MPI world. Also known as the rank of the node.
- Type
int
-
uncoverml.mpiops.
chunks
= 1¶ the total number of nodes in the MPI world
- Type
int
-
uncoverml.mpiops.
comm
= <mpi4py.MPI.Intracomm object>¶ module-level MPI ‘world’ object representing all connected nodes
-
uncoverml.mpiops.
count
(x)¶
-
uncoverml.mpiops.
count_targets
(targets)¶
-
uncoverml.mpiops.
covariance
(x)¶
Create a shared numpy array among MPI nodes. To access the data, refer to the return numpy array ‘shared’. The second return value is the MPI window. This doesn’t need to be interacted with except when deallocating the memory.
When finished with the data, set shared = None and call win.Free().
Caution: any node with a handle on the shared array can modify its contents. To be safe, the shared array is set to read-only by default.
- Parameters
data (numpy.ndarray) – The numpy array to share.
root (int) – Rank of the root node that contains the original data.
writeable (bool) – Whether or not the resulting shared array is writeable.
- Returns
- Return type
tuple of numpy.ndarray, MPI window
-
uncoverml.mpiops.
eigen_decomposition
(x)¶
-
uncoverml.mpiops.
max_axis_0
(x, y, dtype)¶
-
uncoverml.mpiops.
mean
(x)¶
-
uncoverml.mpiops.
min_axis_0
(x, y, dtype)¶
-
uncoverml.mpiops.
minimum
(x)¶
-
uncoverml.mpiops.
outer
(x)¶
-
uncoverml.mpiops.
outer_count
(x)¶
-
uncoverml.mpiops.
power
(x, exp)¶
-
uncoverml.mpiops.
random_full_points
(x, Napprox)¶
-
uncoverml.mpiops.
run_once
(f, *args, **kwargs)¶ Run a function on one node, broadcast result to all This function evaluates a function on a single node in the MPI world, then broadcasts the result of that function to every node in the world. :param f: The function to be evaluated. Can take arbitrary arguments and return
anything or nothing
- Parameters
args (optional) – Other positional arguments to pass on to f
kwargs (optional) – Other named arguments to pass on to f
- Returns
The value returned by f
- Return type
result
-
uncoverml.mpiops.
sd
(x)¶
-
uncoverml.mpiops.
sum_axis_0
(x, y, dtype)¶
-
uncoverml.mpiops.
unique
(sets1, sets2, dtype)¶
uncoverml.patch module¶
Image patch extraction and windowing utilities.
-
uncoverml.patch.
all_patches
(image, patchsize)¶
-
uncoverml.patch.
grid_patches
(image, pwidth)¶ Generate (overlapping) patches from an image. This function extracts square patches from an image in an overlapping, dense grid.
- Parameters
image (ndarray) – an array of shape (x, y) or (x, y, channels).
pwidth (int) – the half-width of the square patches to extract, in pixels. E.g. pwidth = 0 gives a 1x1 patch, pwidth = 1 gives a 3x3 patch, pwidth = 2 gives a 5x5 patch etc. The formula for calculating the full patch width is pwidth * 2 + 1.
- Returns
patch – An image of shape (x, y, channels*psize*psize), where psize = pwidth * 2 + 1
- Return type
ndarray
-
uncoverml.patch.
patches_at_target
(image, patchsize, targets)¶
-
uncoverml.patch.
point_patches
(image, pwidth, points)¶ Extract patches from an image at specified points.
- Parameters
image (ndarray) – an array of shape (x, y, channels).
pwidth (int) – the half-width of the square patches to extract, in pixels. E.g. pwidth = 0 gives a 1x1 patch, pwidth = 1 gives a 3x3 patch, pwidth = 2 gives a 5x5 patch etc. The formula for calculating the full patch width is pwidth * 2 + 1.
points (ndarray) – of shape (N, 2) where there are N points, each with an x and y coordinate of the patch centre within the image.
- Returns
patches – An image patch array of shape (N, psize, psize, channels), where psize = pwidth * 2 + 1
- Return type
ndarray
uncoverml.predict module¶
-
uncoverml.predict.
cluster_analysis
(x, y, partition_no, config, feature_names)¶ - Parameters
x (ndarray) – array of dim (Ns, d)
y (ndarry) – array of predictions of dimension (Ns, 1)
partition_no (int) – partition number of the image
config (config object) –
feature_names (list) – list of strings corresponding to ordered feature names
-
uncoverml.predict.
div0
(a, b)¶ ignore / 0, div0( [-1, 0, 1], 0 ) -> [0, 0, 0]
-
uncoverml.predict.
final_cluster_analysis
(n_classes, n_paritions)¶
-
uncoverml.predict.
predict
(data, model, interval=0.95, **kwargs)¶
-
uncoverml.predict.
render_partition
(model, subchunk, image_out, config)¶
-
uncoverml.predict.
shapefile_prediction
(config, model)¶
-
uncoverml.predict.
write_mean_and_sd
(x, y, writer, config)¶
uncoverml.resampling module¶
Module for shapefile resampling methods. This code was originailly developed by Sudipta Basak. (https://github.com/basaks)
See uncoverml.scripts.shiftmap_cli for a resampling CLI.
-
uncoverml.resampling.
bootstrap_data_indicies
(population, samples=None, random_state=1)¶
-
uncoverml.resampling.
filter_fields
(fields_to_keep, gdf)¶
-
uncoverml.resampling.
prepapre_dataframe
(data, fields_to_keep)¶
-
uncoverml.resampling.
resample_by_magnitude
(input_data, target_field, bins=10, interval='percentile', fields_to_keep=[], bootstrap=True, output_samples=None, validation=False, validation_points=100)¶ - Parameters
input_gdf (geopandas.GeoDataFrame) – Geopandas dataframe containing targets to be resampled.
target_field (str) – target field name based on which resampling is performed. Field must exist in the input_shapefile
bins (int) – number of bins for sampling
fields_to_keep (list) – of strings to store in the output shapefile
bootstrap (bool, optional) – whether to sample with replacement or not
output_samples (int, optional) – number of samples in the output shpfile. If not provided, the output samples will be assumed to be the same as the original shapefile
validation (bool, optional) – validation file name
validation_points (int, optional) – approximate number of points in the validation shapefile
-
uncoverml.resampling.
resample_spatially
(input_data, target_field, rows=10, cols=10, fields_to_keep=[], bootstrap=True, output_samples=None, validation_points=100)¶ - Parameters
input_shapefile –
output_shapefile –
target_field (str) – target field name based on which resampling is performed. Field must exist in the input_shapefile
rows (int, optional) – number of bins in y
cols (int, optional) – number of bins in x
fields_to_keep (list of strings to store in the output shapefile) –
bootstrap (bool, optional) – whether to sample with replacement or not
output_samples (int, optional) – number of samples in the output shpfile. If not provided, the output samples will be assumed to be the same as the original shapefile
validation_points (int, optional) – approximate number of points in the validation shapefile
- Returns
- Return type
output_shapefile name
uncoverml.targets module¶
-
class
uncoverml.targets.
Targets
(lonlat, vals, othervals=None)¶ Bases:
object
-
classmethod
from_geodataframe
(gdf, observations_field='observations')¶ Returns a Targets object from a geopandas dataframe. One column will be taken as the main ‘observations’ field. All remaining non-geometry columns will be stored in the fields property.
- Parameters
observations_field (str) – Name of the column in the dataframe that is the main target observation (the field to train on).
- Returns
- Return type
-
to_geodataframe
()¶ Returns a copy of the targets as a geopandas dataframe.
- Returns
- Return type
geopandas.GeoDataFrame
-
classmethod
-
uncoverml.targets.
gather_targets
(targets, keep, node=None)¶
-
uncoverml.targets.
gather_targets_main
(targets, keep, node)¶
-
uncoverml.targets.
generate_covariate_shift_targets
(targets, bounds, n_points)¶
-
uncoverml.targets.
generate_dummy_targets
(bounds, label, n_points, field_keys=[], seed=1)¶ Generate dummy points with randomly generated positions. Points are generated on node 0 and distributed to other nodes if running in parallel.
- Parameters
bounds (tuple of float) – Bounding box to generate targets within, of format (xmin, ymin, xmax, ymax).
label (str) – Label to assign generated targets.
n_points (int) – Number of points to generate
field_keys (list of str, optional) – List of keys to add to fields property.
seed (int, optional) – Random number generator seed.
- Returns
A collection of randomly generated targets.
- Return type
-
uncoverml.targets.
label_targets
(targets, label, backup_field=None)¶ Replaces target observations (the target property being trained on) with the given label.
-
uncoverml.targets.
merge_targets
(a, b)¶ Merges two Targets objects. They will be sorted the canonical uncover-ml way: lexically by position (y, x).
- Parameters
a (Target) – The Targets to merge.
b (Target) – The Targets to merge.
- Returns
A single merged collection of targets.
- Return type
-
uncoverml.targets.
save_dropped_targets
(config, keep, targets)¶
-
uncoverml.targets.
save_targets
(targets, path, obs_filter=None)¶ Saves target positions and observation data to a CSV file.
- Parameters
targets (Targets) – The targets to save.
path (str) – Path to file to save as.
obs_filter (any, optional) – If provided, will only save points that have this observation data.
uncoverml.validate module¶
Scripts for validation
-
class
uncoverml.validate.
CrossvalInfo
(scores, y_true, y_pred, classification, positions)¶ Bases:
object
-
export_crossval
(config)¶ Exports a CSV file containing real target values and their corresponding predicted value generated as part of cross-validation.
Also populates the ‘prediction’ column of the ‘rawcovariates’ CSV file.
If enabled, the real vs predicted values will be plotted.
- Parameters
config (Config) – Uncover-ml config object.
-
-
class
uncoverml.validate.
OOSInfo
(scores, y_true, y_pred, classification, positions)¶ Bases:
uncoverml.validate.CrossvalInfo
-
export_scores
(config)¶
-
-
uncoverml.validate.
adjusted_r2_score
(r2, n_samples, n_covariates)¶
-
uncoverml.validate.
classification_validation_scores
(ys, eys, pys)¶ Calculates the validation scores for a regression prediction Given the test and training data, as well as the outputs from every model, this function calculates all of the applicable metrics in the following list, and returns a dictionary with the following (possible) keys:
accuracy
log_loss
f1
- Parameters
ys (numpy.array) – The test data outputs, one-hot representation
eys (numpy.array) – The (hard) predictions made by the trained model on test data, one-hot representation
pys (numpy.array) – The probabilistic predictions made by the trained model on test data
- Returns
scores – A dictionary containing all of the evaluated scores.
- Return type
dict
-
uncoverml.validate.
local_crossval
(x_all, targets_all, config)¶ Performs K-fold cross validation to test the applicability of a model. Given a set of inputs and outputs, this function will evaluate the effectiveness of a model at predicting the targets, by splitting all of the known data. A model is trained on a subset of the total data, and then this model is used to predict all of the unseen targets, its performance can provide a benchmark to evaluate the effectiveness of a model.
- Parameters
x_all (numpy.array) – A 2D array containing all of the training inputs
targets_all (numpy.array) – A 1D vector containing all of the training outputs
config (dict) – The global config object, which is used to choose the model to train.
- Returns
result – A dictionary containing all of the cross validation metrics, evaluated on the unseen data subset.
- Return type
dict
-
uncoverml.validate.
local_rank_features
(image_chunk_sets, transform_sets, targets, config)¶ Ranks the importance of the features based on their performance. This function trains and cross-validates a model with each individual feature removed and then measures the performance of the model with that feature removed. The most important feature is the one which; when removed, causes the greatest degradation in the performance of the model.
- Parameters
image_chunk_sets (dict) – A dictionary used to get the set of images to test on.
transform_sets (list) – A dictionary containing the applied transformations
targets (instance of geoio.Targets class) – The targets used in the cross validation
config (config class instance) – The global config file
-
uncoverml.validate.
out_of_sample_validation
(model, targets, features, config)¶
-
uncoverml.validate.
permutation_importance
(model, x_all, targets_all, config)¶
-
uncoverml.validate.
regression_validation_scores
(y, ey, n_covariates, model)¶ Calculates the validation scores for a regression prediction Given the test and training data, as well as the outputs from every model, this function calculates all of the applicable metrics in the following list, and returns a dictionary with the following (possible) keys:
r2_score
expvar
smse
lins_ccc
mll
- Parameters
y (numpy.array) – The test data outputs
ey (numpy.array) – The predictions made by the trained model on test data
n_covariates (int) – The number of covariates being used.
- Returns
scores – A dictionary containing all of the evaluated scores.
- Return type
dict
-
uncoverml.validate.
split_cfold
(nsamples, k=5, seed=None)¶ Function that returns indices for splitting data into random folds.
- Parameters
nsamples (int) – the number of samples in the dataset
k (int, optional) – the number of folds
seed (int, optional) – random seed to provide to numpy
- Returns
cvinds (list) – list of arrays of length k, each with approximate shape (nsamples / k,) of indices. These indices are randomly permuted (without replacement) of assignments to each fold.
cvassigns (ndarray) – array of shape (nsamples,) with each element in [0, k), that can be used to assign data to a fold. This corresponds to the indices of cvinds.