uncoverml package¶

Subpackages¶

Submodules¶

uncoverml.cluster module¶

class uncoverml.cluster.KMeans(k, oversample_factor)¶

Bases: object

Model object implementing learn and predict with K-means

Parameters

k (int > 0) – The number of classes to cluster the data into
oversample_factor (int > 1) – Controls the number of samples draws as part of [1] in the initialisation step. More mpi nodes will increase the total number of points. Consider values of 1 for more than about 16 nodes

References

1: Bahmani, Bahman, Benjamin Moseley, Andrea

Vattani, Ravi Kumar, and Sergei Vassilvitskii. “Scalable k-means++.” Proceedings of the VLDB Endowment 5, no. 7 (2012): 622-633.

get_predict_tags()¶

learn(x, indices=None, classes=None)¶

Find the cluster centres using k-means||

Parameters

x (ndarray) – (n_samples, n_dimensions) length array containing the training samples to cluster
indices (ndarray) – (n_samples) length integer array giving the locations in x where labels exist
classes (ndarray) – (n_samples) length integer array giving the class assignments of points in x in locations given by indices

predict(x, *args, **kwargs)¶

class uncoverml.cluster.TrainingData(indices, classes)¶

Bases: object

Light wrapper for the indices and values of training data

Parameters

indices (ndarray) – length N array of the indices of the input data that have classes assigned
classes (ndarray) – length N int array of the class values at locations specified by indices

uncoverml.cluster.centroid(X, weights=None)¶

Compute the centroid of a set of points X

The points X may have repetitions given by the weights.

Parameters

X (ndarray) – (n, d) array of n d-dimensional points
weights (ndarray (optional)) – (n,) array of weights giving the repetition (or mass?) of each X

Returns

centroid – (d,) length array, the d-dimensional centroid point of all x in X.

Return type

ndarray

uncoverml.cluster.compute_class(X, C, training_data=None)¶

Find the closest cluster centre for each x in X

This returns which cluster centre each X belongs to, with optional semi-supervised training data that will force an assignment of a point to a particular class

Parameters

X (ndarray) – (n, d) array of n d-dimensional points to be evaluated
C (ndarray) – (k, d) array of cluster centres, associated with classes 0..k-1
training_data (TrainingData (optional)) – instance of TrainingData containing fixed class assignments for particular points

Returns

classes (ndarray) – (n,) int array of class assignments (0..k-1) for each x in X
cost (float) – The total ‘cost’ of the assignment, which is the average distance of all points to their assigned centre

uncoverml.cluster.compute_n_classes(classes, config)¶

The number of cluster centres to use for K-means

Just handles the case where someone specifies k=5 but labels 10 classes in the training data. This will return k=10.

Parameters

classes (ndarray) – an array of hard class assignments given as training data
config (Config) – The app config class holding the number of classes asked for

Returns

k – The max of k and the number of classes referenced in the training data

Return type

int > 0

uncoverml.cluster.compute_weights(x, C)¶

Number of points in x assigned to each centre c in C

Parameters

x (ndarray) – (n, d) array of n d-dimensional points
C (ndarray) – (k, d) array of k cluster centres

Returns

weights – (k,) length array giving number of x closest to each c in C

Return type

ndarray

uncoverml.cluster.initialise_centres(X, k, l, training_data=None, max_iterations=1000)¶

Use Kmeans|| to find initial cluster centres

This algorithm finds generates log(n) candidate samples efficiently, then uses k-means to cluster them into k initial starting centres used in the main algorithm (clustering X)

Parameters

X (ndarray) – (n,d) array of points to cluster
k (int > 0) – number of clusters
l (float > 0) – Oversample factor. See weighted_starting_candidates.
training_data (TrainingData (optional)) – Optional hard assignments of certain points in X
max_iterations (int > 0) – The algorithm will terminate after this many iterations even if it hasn’t converged.

Returns

C_init – (k, d) array of starting cluster centres for clustering X with k-means.

Return type

ndarray

uncoverml.cluster.kmean_distance2(x, C)¶

Compute squared euclidian distance to the nearest cluster centre

Parameters

x (ndarray) – (n, d) array of n d-dimensional points
C (ndarray) – (k, d) array of k cluster centres

Returns

d2_x – (n,) length array of distances from each x to the nearest centre

Return type

ndarray

uncoverml.cluster.kmeans_step(X, C, classes, weights=None)¶

A single step of the k-means algorithm.

Assigns every point in X a centre, then computes the centroid of all x assigned to each centre, then updates that centre to be the new centroid.

Parameters

X (ndarray) – (n, d) array of points to be clustered
C (ndarray) – (k, d) array of initial cluster centres
classes (ndarray) – (n,) array of initial class assignments
weights (ndarray (optional)) – weights for points x in X that allow for different ‘masses’ or repetitions in the centroid calculation

Returns

C_new – (k, d) array of new cluster centres

Return type

ndarray

uncoverml.cluster.log = <Logger uncoverml.cluster (WARNING)>¶: Never use more than this many x’s to compute a distance matrix (save memory!)

uncoverml.cluster.reseed_point(X, C, index)¶

Re-initialise the centre of a class if it loses all its members

This should almost never happen. If it does, find the point furthest from all the other cluster centres and use that. Maybe a bad idea but a decent first pass

Parameters

X (ndarray) – (n, d) array of points
C (ndarray) – (k, d) array of cluster centres
index (int >= 0) – index between 0..k-1 of the cluster that has lost it’s points

Returns

new_point – d-dimensional point for replacing the empty cluster centre.

Return type

ndarray

uncoverml.cluster.run_kmeans(X, C, k, weights=None, training_data=None, max_iterations=1000)¶

Cluster points into k clusters using K-means

This is a distributed implementation of Johnson’s algorithm that performs a convex optimization to find the locally optimal assignment of points and cluster centres. It depends heavily on the inital cluster centres C

Parameters

X (ndarray) – (n, d) array n d-dimensional of points to cluster
C (ndarray) – (k, d) array of initial cluster centres
k (int > 0) – number of clusters
weights (ndarray (optional)) – (n,) array of optional repetition weights for points in X, A weight of 2. implies there are 2 points at that location
training_data (TrainingData (optional)) – An instance of the TrainingData class containing fixed cluster assignments for some of the x in X
max_iterations (int > 0 (optional)) – The algorithm will return after this many iterations, even if it hasn’t converged

Returns

C (ndarray) – (k, d) array of final cluster centres, ordered (0..k-1)
classes (ndarray) – (n,) array of class assignments (0..k-1) for each x in X

uncoverml.cluster.sum_axis_0(x, y, dtype)¶: Reduce operation that sums 2 arrays on axis zero

uncoverml.cluster.weighted_starting_candidates(X, k, l)¶

Generate (weighted) candidates to initialise the full k-means

See the kmeans|| algorithm/paper for details. The goal is to find points that are good starting cluster centres for a full kmeans using only log(n) passes through the data

Parameters

X (ndarray) – (n, d) array of n d-dimensional points to be clustered
k (int > 0) – number of clusters
l (float > 0) – The ‘oversample factor’ that controls how many candidates are found. Candidates are found independently on each node so this can be smaller with a bigger computation.

Returns

w (ndarray) – The ‘weights’ of the cluster centres, which are the number of points in X closest to each centre
C (ndarray) – The cluster centres themselves. The total candidates is not known beforehand so the array will be shaped (z, d) where z is some number that increases with l.

uncoverml.config module¶

Handles parsing of the configuration file.

class uncoverml.config.Config(yaml_file, clustering=False, learning=False, resampling=False, predicting=False, shiftmap=True)¶

Bases: object

Class representing the global configuration of the uncoverml scripts.

This class is mostly read-only, but it does also contain the Transform objects which have state. In some execution paths, config flags are switched off then back on (e.g. in cross validation).

Along with the YAML file, the init also takes some flags. These are set by the top-level CLI scripts and are used to determine what parameters to load and what can be ignored.

All attributes following output_dir (located at the bottom of init) are undocumented but should be self-explanatory. They are full paths to output for different features.

Todo

Factor out stateful Transform objects.

Parameters

yaml_file (str) – The path to the yaml config file.
clustering (bool) – True if clustering.
learning (bool) – True if learning.
resampling (bool) – True if resampling.
predicting (bool) – True if predicting.

name¶

Name oo the config file.

Type: str

algorithm¶

Name of the model to train. See Models for available models.

Type: str

algorithm_args¶

A dictionary of arguments to pass to selected model. See Models for available arguments to model. Key is the argument name exactly as it appears in model __init__ (this dict gets passed as kwargs).

Type: dict(str, any)

cubist¶

True if cubist algorithm is being used.

Type: bool

multicubist¶

True if multicubist algorithm is being used.

Type: bool

multirandomforest¶

True if multirandomforest algorithm is being used.

Type: bool

krige¶

True if kriging is being used.

Type: bool

bootstrap¶

True if a bootstrapped algorithm is being used.

Type: bool

clustering¶

True if clustering is being performed.

Type: bool

n_classes¶

Number of classes to cluster into. Required if clustering.

Type: int

oversample_factor¶

Controls how many candidates are found for cluster initialisation when running kmeans clustering. See weighted_starting_candidates(). Required when clustering.

Type: float

cluster_analysis¶

True if analysis should be performed post-clustering. Optional, default is False.

Type: bool, optional

class_file¶

Define classes for clustering feature data. Path to shapefile that defines class at positions.

Type: str or bytes, optional

semi_supervised¶

True if semi_supervised clustering is being performed (i.e. class_file has been provided).

Type: bool

target_search¶

True if target_search feature is being used.

Type: bool

target_search_threshold¶

Target search threshold, float between 0 and 1. The likelihood a training point must surpass to be included in found points.

Type: float

target_search_extents¶

A bounding box defining the image area to search for additional targets.

Type: tuple(float, float, float, float)

tse_are_pixel_coordinates¶

If True, target_search_extents are treated as pixel coordinates instead of CRS coordinates.

Type: bool

extents¶

A bounding box defining the area to learn and predict on. Data outside these extents gets cropped. Optional, if not provided whole image area is used.

Type: tuple(float, float, float, float), optional

extents_are_pixel_coordinates¶

If True, extents are treated as pixel coordinates instead of CRS coordinates.

Type: bool

pk_covarates¶

Path to where to save pickled covariates, or a pre-existing covariate pickle file if loading pickled covariates.

Type: str or bytes

pk_targets¶

Path to where to save pickled targets, or a pre-existing target pickle file if loading pickled targets.

Type: str or bytes

pk_load¶

True if both pk_covariates and pk_targets are provided and these paths exist (it’s assumed they contain the correct pickled data).

Type: bool

feature_sets¶

The provided features as FeatureSetConfig objects. These contain paths to the feature files and importantly the Transform objects which contain statistics used to transform the covariates. These Transform objects and contained statistics must be maintained across workflow steps (aka CLI commands).

Type: FeatureSetConfig

patchsize¶

Half-width of the patches that feature data will be chunked into. Height/width of each patch is equal to patchsize * 2 + 1.

Todo

Not implemented, defaults to 1.

Type: int

target_file¶

Path to a shapefile defining the targets to be trained on.

Type: str or bytes

target_property¶

Name of the field in the target_file to be used as training property.

Type: str

target_weight_property¶

Name of the field in the target_file to be used as target weights.

Type: str, optional

fields_to_write_to_csv¶

List of field names in the target_file to be included in output table.

Type: list(str), optional

shiftmap_targets¶

Path to a shapefile containing targets to generate shiftmap from. This is optional, by default shiftmap will generate dummy targets by randomly sampling the target shapefile.

Type: str or bytes, optional

spatial_resampling_args¶

Kwargs for spatial resampling. See Resampling for more details.

Type: dict

value_resampling_args¶

Kwargs for value resampling. See Resampling for more details.

Type: dict

final_transform¶

Transforms to apply to whole image set after other preprocessing has been performed.

Type: TransformSet

oos_percentage¶

Float between 0 and 1. The percentage of targets to withhold from training to be used in out-of-sample validation.

Type: float, optional

oos_shapefile¶

Shapefile containing targets to be used in out-of-sample validation.

Type: str or bytes, optional

oos_property¶

Name of the property in oos_shapefile to be used in validation. Only required if an OOS shapefile is provided.

Type: str

out_of_sample_validation¶

True if out of sample validation is to be performed.

Type: bool

rank_features¶

True if ‘feature_ranking’ is True in ‘validation’ block of the config. Turns on feature ranking. Default is False.

Type: bool, optional

permutation_importance¶

True if ‘permutation_importance’ is True in ‘validation’ block of the config. Turns on permutation importance. Default is False.

Type: bool

parallel_validate¶

True if ‘parallel’ is present in ‘k-fold’ block of config. Turns on parallel k-fold cross validation. Default is False.

Type: bool, optional

cross_validate¶

True if ‘k-fold’ block is present in ‘validation’ block of config. Turns on k-fold cross validation.

Type: bool, optional

folds¶

The number of folds to split dataset into for cross validation. Required if cross_validate is True.

Type: int

crossval_seed¶

Seed for random sorting of folds for cross validation. Required if cross_validate is True.

Type: int

optimisation¶

Dictionary of optimisation arguments. See Optimisation for details.

Type: dict

geotiff_options¶

Optional creation options passed to the geotiff output driver. See https://gdal.org/drivers/raster/gtiff.html#creation-options for a list of creation options.

Type: dict, optional

quantiles¶

Prediction quantile/interval for predicted values.

Type: float

outbands¶

The outbands to write in the prediction output file. Used as the ‘stop’ for a slice taken from list of prediction tags, i.e. [0: outbands]. If the resulting slice is greater than the number of tags available, then all tags will be selected. If no value is provied, then all tags will be selected.

Todo

Having this as a slice is questionable. Should be simplified.

Type: int

thumbnails¶

Subsampling factor for thumbnails of output images. Default is 10.

Type: int, optional

bootstrap_predictions¶

Only applies if a bootstrapped algorithm is being used. This is the number of predictions to perform, by default will predict on all sub-models. E.g. if you had a BS algorithm containing 100 sub-models, you could limit a test prediction to 20 using this parameter to speed things up.

Type: int, optional

mask¶

Path to a geotiff file for masking the output prediction map. Only values that have been masked will be predicted.

Type: str, optional

retain¶

Value in the above mask that indicates cell should be retained and predicted. Must be provided if a mask is provided.

Type: int

lon_lat¶

Dictionary containing paths to longitude and latitude grids used in kriging.

Type: dict, optional

output_dir¶

Path to directory where prediciton map and other outputs will be written.

Type: str

static parse_extents(exb)¶: Validates extents parameters.

set_algo_flags()¶: Convenience method for setting boolean flags based on the algorithm being used.

property tmpdir¶: Convenience method for creating tmpdir needed by some UncoverML functionality.

yaml_loader¶: alias of yaml.loader.SafeLoader

exception uncoverml.config.ConfigException¶: Bases: Exception

class uncoverml.config.FeatureSetConfig(config_dict)¶

Bases: object

Config class representing a ‘feature set’ in the config file.

Parameters: config_dict (dict) – The section of the yaml file for a feature set.

name¶

Name of the feature set.

Type: str

type¶

Data type of the feature set (‘categorical’ or ‘ordinal’).

Type: str

files¶

Absolute paths to .tif files of the feature set.

Type: list of str

transform_set¶

Transforms specified for the feautre set.

Type: ImageTransformSet

uncoverml.cubist module¶

class uncoverml.cubist.Cubist(name='temp', print_output=False, unbiased=True, max_rules=None, committee_members=1, max_categories=5000, sampling=None, seed=None, neighbors=None, feature_type=None, composite_model=False, auto=False, extrapolation=None, calc_usage=False, bootstrap=None)¶

Bases: object

This class wraps the cubist command line tools in a scikit-learn interface. The learning phase relies on the cubist command line tools, whereas the predictions themselves are executed directly in python.

fit(x, y)¶

Train the Cubist model Given a matrix of values (X) and an output vector of values (y), this method will train the cubist model and then read the training files directly as parameters of this class.

Parameters

x (numpy.array) – X contains all of the training inputs, This should be a matrix of values, where x.shape[0] = n, where n is the number of available training points.
y (numpy.array) – y contains the output target variables for each corresponding input vector. Again we expect y.shape[0] = n.

predict(x)¶

Predicts the y values that correspond to each input Just like predict_dist, this predicts the output value, given a list of inputs contained in x.

Parameters: x (numpy.array) – The inputs for which the model should be evaluated
Returns: y_mean – An array of expected output values given the inputs
Return type: numpy.array

predict_dist(x, interval=0.95)¶

Predict the outputs and variances of the inputs This method predicts the output values that would correspond to each input in X. This method also returns the certainty of the model in each case, which is only sensible when the number of commitee members is greater than one.

This method also outputs quantile information along with the variance to establish the probability distribution clearly.

Parameters

x (numpy.array) – The inputs for which the model should be evaluated
interval (float) – The probability threshold for which the quantiles should be output.

Returns

y_mean (numpy.array) – An array of expected output values given the inputs
y_var (numpy.array) – The variance of the outputs
ql (numpy.array) – The lower quantiles for each input
qu (numpy.array) – The upper quantiles for each input

class uncoverml.cubist.CubistReportRow(cond, model, feature)¶

Bases: object

convenience class for accumulating cubist report

class uncoverml.cubist.MultiCubist(outdir='.', trees=10, print_output=False, unbiased=True, max_rules=None, committee_members=1, max_categories=5000, neighbors=None, feature_type=None, sampling=70, seed=None, extrapolation=None, composite_model=False, auto=False, parallel=False, calc_usage=False, bootstrap=None)¶

Bases: object

This is a wrapper on Cubist.

calculate_usage()¶: Averages the Cond and Model statistics of all the cubist runs

fit(x, y)¶

Train the Cubist model Given a matrix of values (X) and an output vector of values (y), this method will train the cubist model and then read the training files directly as parameters of this class.

Parameters

x (numpy.array) – X contains all of the training inputs, This should be a matrix of values, where x.shape[0] = n, where n is the number of available training points.
y (numpy.array) – y contains the output target variables for each corresponding input vector. Again we expect y.shape[0] = n.

predict(x)¶

Predicts the y values that correspond to each input Just like predict_dist, this predicts the output value, given a list of inputs contained in x.

Parameters: x (numpy.array) – The inputs for which the model should be evaluated
Returns: y_mean – An array of expected output values given the inputs
Return type: numpy.array

predict_dist(x, interval=0.95)¶

Predict the outputs and variances of the inputs This method predicts the output values that would correspond to each input in X. This method also returns the certainty of the model in each case, which is only sensible when the number of commitee members is greater than one.

This method also outputs quantile information along with the variance to establish the probability distribution clearly.

Parameters

x (numpy.array) – The inputs for which the model should be evaluated
interval (float) – The probability threshold for which the quantiles should be output.

Returns

y_mean (numpy.array) – An array of expected output values given the inputs
y_var (numpy.array) – The variance of the outputs
ql (numpy.array) – The lower quantiles for each input
qu (numpy.array) – The upper quantiles for each input

class uncoverml.cubist.Rule(rule, m)¶

Bases: object

comparator = {'<': <ufunc 'less'>, '<=': <ufunc 'less_equal'>, '=': <ufunc 'equal'>, '>': <ufunc 'greater'>, '>=': <ufunc 'greater_equal'>}¶

regress(x, mask=None)¶

satisfied(x)¶

uncoverml.cubist.arguments(p)¶

uncoverml.cubist.cond_line(line)¶

uncoverml.cubist.mean(numbers)¶

uncoverml.cubist.pairwise(iterable)¶

uncoverml.cubist.parse_float_array(arraystring)¶

uncoverml.cubist.read_data(filename)¶

uncoverml.cubist.remove_first_line(line)¶

uncoverml.cubist.save_data(filename, data)¶

uncoverml.cubist.variance_with_mean(mean)¶

uncoverml.cubist.write_dict(filename, dict_obj)¶

uncoverml.cubist_config module¶

uncoverml.diagnostics module¶

This module contains functionality for plotting validation scores and other diagnostic information.

uncoverml.diagnostics.plot_covariate_correlation(path, method='pearson')¶

Plots matrix of correlation between covariates.

Parameters

path (str) – Path to ‘rawcovariates’ CSV file.
method (str, optional) – Correlation coefficient to calculate. Choices are ‘pearson’, ‘kendall’, ‘spearman’. Default is ‘pearson’.

Returns

The matrix plot as a matplotlib Figure.

Return type

obj:matplotlib.figure.Figure

uncoverml.diagnostics.plot_covariates_x_targets(path, cols=2, subplot_width=8, subplot_height=4)¶

Plots scatter plots of each covariate intersected with target values.

Parameters

path (str) – Path to ‘rawcovariates’ CSV file containing intersection of targets and covariates.
cols (int, optional) – The number of columns to split the figure into. Default is 1.
subplot_width (int) – Width of each subplot in inches. Default is 8.
subplot_height (int) – Width of each subplot in inches. Default is 4.

Returns

The scatter plots as a matplotlib Figure.

Return type

obj:matplotlib.figure.Figure

uncoverml.diagnostics.plot_feature_rank_curves(path, subplot_width=8, subplot_height=4)¶

Plots curves for feature ranking of each metric.

Parameters

path (str) – Path to ‘featureranks’ JSON file.
subplot_width (int, optional) – Width of each subplot. Default is 8.
subplot_height (int, optional) – Height of each subplot. Default is 4.

Returns

The plots as a matplotlib Figure.

Return type

obj:matplotlib.figure.Figure

uncoverml.diagnostics.plot_feature_ranks(path, barwidth=0.08, figsize=(15, 9))¶

Plots a grouped bar chart of feature rank scores, grouped by covariate. Depending on the number of covariates and metrics being calculated you may need to tweak barwidth and figsize so the bars fit.

Parameters

path (str) – Path to JSON file containing feature ranking results.
barwidth (float, optional) – Width of the bars.
figsize (tuple(float, float), optional) – The (width, height) of the figure in inches.

Returns

The bar chart as a matplotlib Figure.

Return type

obj:matplotlib.figure.Figure

uncoverml.diagnostics.plot_real_vs_pred_crossval(crossval_path, scores_path=None, bins=20, overlay=False, hist_cm=None, scatter_color=None, figsize=(25, 12.5), point_size=None)¶

uncoverml.diagnostics.plot_real_vs_pred_prediction(rc_path, pred_path, scores_path=None, bins=20, overlay=False, hist_cm=None, scatter_color=None, figsize=(25, 12.5), point_size=None)¶

uncoverml.diagnostics.plot_residual_error_crossval(crossval_path, bins=20)¶

uncoverml.diagnostics.plot_residual_error_prediction(rc_path, pred_path, bins=20)¶

uncoverml.diagnostics.plot_target_scaling(path, bins=20, title='Target Scaling', sharey=False)¶

Plots histograms of target values pre and post-scaling.

Parameters

path (str) – Path to ‘transformed_targets’ CSV file.
bins (int, optional) – The number of value bins for the histograms. Default is 20.
title (str, optional) – The title of the plot. Defaults to ‘Target Scaling’.
sharey (bool) – Whether the plots will share a y-axis and scale. Default is False.

Returns

The histograms as a matplotlib Figure.

Return type

obj:maplotlib.figure.Figure

uncoverml.features module¶

uncoverml.features.cull_all_null_rows(feature_sets)¶

uncoverml.features.extract_features(image_source, targets, n_subchunks, patchsize)¶: each node gets its own share of the targets, so all nodes will always have targets

uncoverml.features.extract_subchunks(image_source, subchunk_index, n_subchunks, patchsize)¶

uncoverml.features.features_from_shapefile(feature_sets, mask=None)¶

uncoverml.features.gather_features(x, node=None)¶

uncoverml.features.intersect_shapefile_features(targets, feature_sets, target_drop_values)¶

Extract covariates from a shapefile. This is done by intersecting targets with the shapefile. The shapefile must have the same number of rows as there are targets.

Drop target values here for tabular predictions. This is mainly for convenience if there are classes or points in the target file that we don’t want to predict on for whatever reason (e.g. out-of-sample validation purposes). It’s done here rather than when targets are first loaded so we don’t also have to handle a mask + targets being returned from target loading as the mask won’t be required in most situations.

Parameters

targets (uncoverml.targets.Targets) – An uncoverml.targets.Targets object that has been loaded from a shapefile.
feature_sets (list of uncoverml.config.FeatureSetConfig) – A list of feature sets of ‘tabular’ type (sourced from shapefiles). Each set must have an attribute file that points to the shapefile to load and attribute fields which is the list of fields to retrieve as covariates from the file.
target_drop_values (list of any) – A list of values where if target observation is equal to value that row is dropped and also won’t be intersected with the covariates.

uncoverml.features.remove_missing(x, targets=None)¶

uncoverml.features.save_intersected_features_and_targets(feature_sets, transform_sets, targets, config, impute=True)¶

This function saves a table of covariate values and the target value intersected at each point. It also contains columns for UID ‘index’ and a predicted value.

If the target shapefile contains an ‘index’ field, this will be used to populate the ‘index’ column. This is intended to be used as a unique ID for each point in post-processing. If no ‘index’ field exists this column will be zero filled.

The ‘prediction’ column is for predicted values created during cross-validation. Again, this is for post-processing. It will only be populated if cross-validation is run later on. If not, it will be zero filled.

Two files will be output:: …/output_dir/{name_of_config}_rawcovariates.csv …/output_dir/{name_of_config}_rawcovariates_mask.csv

This function will also optionally output intersected covariates scatter plot and covariate correlation matrix plot.

uncoverml.features.transform_features(feature_sets, transform_sets, final_transform, config)¶

uncoverml.filtering module¶

Code for computing the gamma sensor footprint, and for applying and unapplying spatial convolution filters to a given image.

BM: this is used in scripts/gammasensor_cli.py - I haven’t used it in my time with uncoverml or seen it used.

uncoverml.filtering.fwd_filter(img, S)¶

uncoverml.filtering.inv_filter(img, S, noise=0.001)¶

uncoverml.filtering.kernel_impute(img, S)¶

uncoverml.filtering.pad2(img)¶

uncoverml.filtering.sensor_footprint(img_w, img_h, res_x, res_y, height, mu_air)¶

uncoverml.geoio module¶

class uncoverml.geoio.ArrayImageSource(A, origin, crs, pixsize)¶

Bases: uncoverml.geoio.ImageSource

An image source that uses an internally stored numpy array

Parameters

A (MaskedArray) – masked array of shape (xpix, ypix, channels) that contains the image data.
origin (ndarray) – Array of the form [lonmin, latmin] that defines the origin of the image
pixsize (ndarray) – Array of the form [pixsize_x, pixsize_y] defining the size of a pixel

data(min_x, max_x, min_y, max_y)¶

class uncoverml.geoio.ImageSource¶

Bases: object

property crs¶

abstract data(min_x, max_x, min_y, max_y)¶

property dtype¶

property full_resolution¶

property nodata_value¶

property origin_latitude¶

property origin_longitude¶

property pixsize_x¶

property pixsize_y¶

class uncoverml.geoio.ImageWriter(shape, bbox, crs, n_subchunks, outpath, outbands, band_tags=None, independent=False, **kwargs)¶

Bases: object

close()¶

nodata_value = array(-1.e+20, dtype=float32)¶

output_thumbnails(ratio=10)¶

write(x, subchunk_index)¶

Parameters

x –
subchunk_index –
independent – bool independent image writing by different processes, i.e., images are not chunked

Returns

class uncoverml.geoio.RasterioImageSource(filename)¶

Bases: uncoverml.geoio.ImageSource

data(min_x, max_x, min_y, max_y)¶

uncoverml.geoio.SharedTrainingData¶: alias of uncoverml.geoio.TrainingData

uncoverml.geoio.create_shared_training_data(targets_all, x_all)¶

uncoverml.geoio.crop_covariates(config, outdir=None)¶

Crops the covariate files listed under config.feature_sets using the bounds provided under config.extents. The cropped covariates are stored in a temporary directory and the paths in config.feature_sets are redirected to theses files. The caller is responsible for removing the files once they have been created.

Parameters

config (uncoverml.config.Config) – Parsed UncoverML config.
outdir (str) – Aboslute path to directory to store cropped covariates. If not provided, a tmp directory will be created.

uncoverml.geoio.crop_mask(config, outdir=None)¶: Crops the prediction mask listed under config.mask.

uncoverml.geoio.crop_tif(filename, extents, pixel_coordinates=False, outfile=None, strict=False)¶

Crops the geotiff using the provided extent.

Parameters

filename (str) – Path to the geotiff to be cropped.
extents (tuple(float, float, float, float)) – Bounding box to crop by, ordering is (xmin, ymin, xmax, ymax). Data outside bounds will be cropped. Any elements that are None will be substituted with the original bound of the geotiff.
outfile (str) – Path to save cropped geotiff. If not provided, will be saved with original name + random id in tmp directory.

uncoverml.geoio.deallocate_shared_training_data(training_data)¶

uncoverml.geoio.distribute_targets(positions, observations, fields)¶: Distributes a target object across all nodes

uncoverml.geoio.export_feature_ranks(measures, feats, scores, config)¶

uncoverml.geoio.export_model(model, config)¶

uncoverml.geoio.feature_names(config)¶

uncoverml.geoio.get_image_bounds(config)¶

uncoverml.geoio.get_image_crs(config)¶

uncoverml.geoio.get_image_pixel_res(config)¶

uncoverml.geoio.get_image_spec(model, config)¶

uncoverml.geoio.image_feature_sets(targets, config)¶

uncoverml.geoio.image_resolutions(config)¶

uncoverml.geoio.image_subchunks(subchunk_index, config)¶

uncoverml.geoio.load_shapefile(filename, targetfield, covariate_crs, extents)¶: TODO

uncoverml.geoio.load_targets(shapefile, targetfield=None, covariate_crs=None, extents=None)¶

Loads the shapefile onto node 0 then distributes it across all available nodes.

Important: here the concatenated targets get sorted on the root processor by position (Y,X). It’s important that this order is preserved. Once covariates are intersected with the target data, they are also in this order. This ordering is what keeps the target and feature arrays synced.

uncoverml.geoio.resample(input_tif, output_tif, ratio, resampling=5)¶

Parameters

input_tif (str or rasterio.io.DatasetReader) – input file path or rasterio.io.DatasetReader object
output_tif (str) – output file path
ratio (float) – ratio by which to shrink/expand ratio > 1 means shrink
resampling (int, optional) – default is 5 (average) resampling. Other options are as follows: nearest = 0 bilinear = 1 cubic = 2 cubic_spline = 3 lanczos = 4 average = 5 mode = 6 gauss = 7 max = 8 min = 9 med = 10 q1 = 11 q3 = 12

uncoverml.geoio.semisupervised_feature_sets(targets, config)¶

uncoverml.geoio.unsupervised_feature_sets(config)¶

uncoverml.geoio.write_shapefile_prediction(pred, pred_tags, positions, config)¶

uncoverml.image module¶

Contains class and routines for reading chunked portions of images.

class uncoverml.image.Image(source, chunk_idx=0, nchunks=1, overlap=0)¶

Bases: object

Represents a raster Image. Can use to get a georeferenced chunk of an Image and the data associated with it. This class is mainly used in the features module for intersecting image chunks with target data and extracting the image data. It’s also used in geoio for getting covariate specs, such as CRS and bounds.

If nchunks > 1, then the Image is striped horizontally. Chunk_idx 0 is the first strip of the image. The X range covers the full width of the image and the Y ranges from 0 to image_height / n_chunks.

Parameters

source (ImageSource) – An instance of ImageSource (typically RasterioImageSource). Defines the image to be loaded.
chunk_idx (int) – Which chunk of the image is being loaded.
nchunks (int) – Total number of chunks being used. This is typically set by the partitions parameter of the top level command, also set as n_subchunks on the Config object.
overlap (int) – Doesn’t seem to be used, but appears to be used for accomodating overlap in chunks (number of rows to overlap with bounding strips).

property channels¶

data()¶

property dtype¶

in_bounds(lonlat)¶

lonlat2pix(lonlat)¶

property nodata_value¶

property npoints¶

patched_bbox(patchsize)¶

patched_shape(patchsize)¶

pix2lonlat(xy)¶

property x_range¶

property xmax¶

property xmin¶

property xres¶

property y_range¶

property ymax¶

property ymin¶

property yres¶

uncoverml.image.bbox2affine(xmax, xmin, ymax, ymin, xres, yres)¶

uncoverml.image.construct_splits(npixels, nchunks, overlap=0)¶: Splits the image horizontally into approximately equal strips according to npixels / nchunks.

uncoverml.interpolate module¶

class uncoverml.interpolate.SKLearnCT(fill_value=0, rescale=False, maxiter=1000, tol=0.0001)¶

Bases: sklearn.base.BaseEstimator, sklearn.base.RegressorMixin

Scikit-learn wrapper for: scipy.interpolate.CloughTocher2DInterpolator class.

CloughTocher2DInterpolator(points, values, tol=1e-6)

Piecewise cubic, C1 smooth, curvature-minimizing interpolant in 2D.

New in version 0.9.

__call__()¶

Parameters

points (ndarray of floats, shape (npoints, ndims); or Delaunay) – Data point coordinates, or a precomputed Delaunay triangulation.
values (ndarray of float or complex, shape (npoints, ..)) – Data values.
fill_value (float, optional) – Value used to fill in for requested points outside of the convex hull of the input points. If not provided, then the default is nan.
tol (float, optional) – Absolute/relative tolerance for gradient estimation.
maxiter (int, optional) – Maximum number of iterations in gradient estimation.
rescale (bool, optional) – Rescale points to unit cube before performing interpolation. This is useful if some of the input dimensions have incommensurable units and differ by many orders of magnitude.

Notes

The interpolant is constructed by triangulating the input data with Qhull [1]_, and constructing a piecewise cubic interpolating Bezier polynomial on each triangle, using a Clough-Tocher scheme [CT]. The interpolant is guaranteed to be continuously differentiable.

The gradients of the interpolant are chosen so that the curvature of the interpolating surface is approximatively minimized. The gradients necessary for this are estimated using the global algorithm described in [Nielson83,Renka84]_.

References

1: http://www.qhull.org/

CT: See, for example, P. Alfeld, ‘’A trivariate Clough-Tocher scheme for tetrahedral data’’. Computer Aided Geometric Design, 1, 169 (1984); G. Farin, ‘’Triangular Bernstein-Bezier patches’’. Computer Aided Geometric Design, 3, 83 (1986).
Nielson83: G. Nielson, ‘’A method for interpolating scattered data based upon a minimum norm network’’. Math. Comp., 40, 253 (1983).
Renka84: R. J. Renka and A. K. Cline. ‘’A Triangle-based C1 interpolation method.’’, Rocky Mountain J. Math., 14, 223 (1984).

fit(X, y)¶

predict(X)¶

class uncoverml.interpolate.SKLearnLinearNDInterpolator(fill_value=0, rescale=False)¶

Bases: sklearn.base.BaseEstimator, sklearn.base.RegressorMixin

Scikit-learn wrapper for: scipy.interpolate.LinearNDInterpolator class.

LinearNDInterpolator(points, values, fill_value=np.nan, rescale=False)

Piecewise linear interpolant in N dimensions.

New in version 0.9.

__call__()¶

Parameters

points (ndarray of floats, shape (npoints, ndims); or Delaunay) – Data point coordinates, or a precomputed Delaunay triangulation.
values (ndarray of float or complex, shape (npoints, ..)) – Data values.
fill_value (float, optional) – Value used to fill in for requested points outside of the convex hull of the input points. If not provided, then the default is nan.
rescale (bool, optional) – Rescale points to unit cube before performing interpolation. This is useful if some of the input dimensions have incommensurable units and differ by many orders of magnitude.

Notes

The interpolant is constructed by triangulating the input data with Qhull [1]_, and on each triangle performing linear barycentric interpolation.

References

1: http://www.qhull.org/

fit(X, y)¶

predict(X)¶

class uncoverml.interpolate.SKLearnNearestNDInterpolator(rescale=False, tree_options=None)¶

Bases: sklearn.base.BaseEstimator, sklearn.base.RegressorMixin

Scikit-learn wrapper for: scipy.interpolate.NearestNDInterpolator class.

NearestNDInterpolator(x, y)

Nearest-neighbour interpolation in N dimensions.

New in version 0.9.

__call__()¶

Parameters

x ((Npoints, Ndims) ndarray of floats) – Data point coordinates.
y ((Npoints,) ndarray of float or complex) – Data values.
rescale (boolean, optional) –
Rescale points to unit cube before performing interpolation. This is useful if some of the input dimensions have incommensurable units and differ by many orders of magnitude.

New in version 0.14.0.
tree_options (dict, optional) –
Options passed to the underlying cKDTree.

New in version 0.17.0.

Notes

Uses scipy.spatial.cKDTree

fit(X, y)¶

predict(X)¶

class uncoverml.interpolate.SKLearnRbf(function='multiquadric', smooth=0, norm='euclidean')¶

Bases: sklearn.base.BaseEstimator, sklearn.base.RegressorMixin

Scikit-learn wrapper for scipy.interpolate.Rbf class.

Rbf(*args)

A class for radial basis function approximation/interpolation of n-dimensional scattered data.

Parameters

*args (arrays) – x, y, z, …, d, where x, y, z, … are the coordinates of the nodes and d is the array of values at the nodes
function (str or callable, optional) –
The radial basis function, based on the radius, r, given by the norm (default is Euclidean distance); the default is ‘multiquadric’:
```
'multiquadric': sqrt((r/self.epsilon)**2 + 1)
'inverse': 1.0/sqrt((r/self.epsilon)**2 + 1)
'gaussian': exp(-(r/self.epsilon)**2)
'linear': r
'cubic': r**3
'quintic': r**5
'thin_plate': r**2 * log(r)
```
If callable, then it must take 2 arguments (self, r). The epsilon parameter will be available as self.epsilon. Other keyword arguments passed in will be available as well.
epsilon (float, optional) – Adjustable constant for gaussian or multiquadrics functions - defaults to approximate average distance between nodes (which is a good start).
smooth (float, optional) – Values greater than zero increase the smoothness of the approximation. 0 is for interpolation (default), the function will always go through the nodal points in this case.
norm (str, callable, optional) – A function that returns the ‘distance’ between two points, with inputs as arrays of positions (x, y, z, …), and an output as an array of distance. E.g., the default: ‘euclidean’, such that the result is a matrix of the distances from each point in x1 to each point in x2. For more options, see documentation of scipy.spatial.distances.cdist.

N¶

The number of data points (as determined by the input arrays).

Type: int

di¶

The 1-D array of data values at each of the data coordinates xi.

Type: ndarray

xi¶

The 2-D array of data coordinates.

Type: ndarray

function¶

The radial basis function. See description under Parameters.

Type: str or callable

epsilon¶

Parameter used by gaussian or multiquadrics functions. See Parameters.

Type: float

smooth¶

Smoothing parameter. See description under Parameters.

Type: float

norm¶

The distance function. See description under Parameters.

Type: str or callable

nodes¶

A 1-D array of node values for the interpolation.

Type: ndarray

A¶

Type: internal property, do not use

Examples

>>> from scipy.interpolate import Rbf
>>> x, y, z, d = np.random.rand(4, 50)
>>> rbfi = Rbf(x, y, z, d)  # radial basis function interpolator instance
>>> xi = yi = zi = np.linspace(0, 1, 20)
>>> di = rbfi(xi, yi, zi)   # interpolated values
>>> di.shape
(20,)

fit(X, y)¶

predict(X)¶

uncoverml.krige module¶

class uncoverml.krige.Krige(method='ordinary', variogram_model='linear', nlags=6, weight=False, n_closest_points=10, verbose=False)¶

Bases: uncoverml.models.TagsMixin, sklearn.base.RegressorMixin, sklearn.base.BaseEstimator, uncoverml.krige.KrigePredictDistMixin

A scikitlearn wrapper class for Ordinary and Universal Kriging. This works for both Grid/RandomSearchCv for optimising the Krige parameters.

fit(x, y, *args, **kwargs)¶

Parameters

x (ndarray) – array of Points, (x, y) pairs
y (ndarray) – array of targets

predict(x, *args, **kwargs)¶

Parameters

x (ndarray) –
Returns –
------- –
array (Prediction) –

class uncoverml.krige.KrigePredictDistMixin¶

Bases: object

Mixin class for providing a predict_dist method to the Krige class.

This is especially for use with PyKrige Ordinary/UniversalKriging classes.

predict_dist(x, interval=0.95, *args, **kwargs)¶

Predictive mean and variance for a probabilistic regressor.

Parameters

x (ndarray) – (Ns, 2) array query dataset (Ns samples, 2 dimensions).
interval (float, optional) – The percentile confidence interval (e.g. 95%) to return.

Returns

prediction (ndarray) – The expected value of ys for the query inputs, X of shape (Ns,).
variance (ndarray) – The expected variance of ys (excluding likelihood noise terms) for the query inputs, X of shape (Ns,).
ql (ndarray) – The lower end point of the interval with shape (Ns,)
qu (ndarray) – The upper end point of the interval with shape (Ns,)

class uncoverml.krige.MLKrige(ml_method, ml_params={}, *args, **kwargs)¶: Bases: object

class uncoverml.krige.MLKrigeBase(ml_method, ml_params={}, method='ordinary', variogram_model='linear', n_closest_points=10, nlags=6, weight=False, verbose=False)¶

Bases: uncoverml.models.TagsMixin

This is an implementation of Regression-Kriging as described here: https://en.wikipedia.org/wiki/Regression-Kriging

fit(x, y, lon_lat, *args, **kwargs)¶

Fit the ML method and also Krige the residual.

Parameters

x (ndarray) – (Nt, d) array query dataset (Ns samples, d dimensions) for ML regression
y (ndarray) – array of targets (Nt, )
lon_lat – ndarray of (x, y) points. Needs to be a (Nt, 2) array corresponding to the lon/lat, for example.

krige_residual(lon_lat)¶

Parameters: lon_lat – ndarray of (x, y) points. Needs to be a (Ns, 2) array corresponding to the lon/lat, for example.
Returns

residual: ndarray: kriged residual values

ml_prediction(x, *args, **kwargs)¶

Parameters

x (ndarray) – regression matrix

Returns

ndarray
machine learning prediction

predict(x, lon_lat, *args, **kwargs)¶

Must override predict_dist method of Krige. Predictive mean and variance for a probabilistic regressor.

Parameters

x (ndarray) – (Ns, d) array query dataset (Ns samples, d dimensions) for ML regression
lon_lat – ndarray of (x, y) points. Needs to be a (Ns, 2) array corresponding to the lon/lat, for example.

Returns

pred – The expected value of ys for the query inputs, X of shape (Ns,).

Return type

ndarray

score(x, y, lon_lat, sample_weight=None)¶: Overloading default regression score method

class uncoverml.krige.MLKrigePredictDistMixin¶

Bases: object

predict_dist(x, interval=0.95, lon_lat=None, *args, **kwargs)¶

Predictive mean, variance, lower and upper quantile for a probabilistic regressor.

Parameters

X (ndarray) – (Ns, d) array query dataset (Ns samples, d dimensions) for ML regression
lon_lat – ndarray of (x, y) points. Needs to be a (Ns, 2) array corresponding to the lon/lat, for example.
interval (float, optional) – The percentile confidence interval (e.g. 95%) to return.
must contain a key lon_lat (kwargs) –
needs to be a (Ns (which) –
array (2)) –
to the lon/lat (corresponding) –

Returns

pred (ndarray) – The expected value of ys for the query inputs, X of shape (Ns,).
var (ndarray) – The expected variance of ys (excluding likelihood noise terms) for the query inputs, X of shape (Ns,).
ql (ndarray) – The lower end point of the interval with shape (Ns,)
qu (ndarray) – The upper end point of the interval with shape (Ns,)

class uncoverml.krige.MLKrigePreidctDist(*args, **kwargs)¶: Bases: uncoverml.krige.MLKrigeBase, uncoverml.krige.MLKrigePredictDistMixin

uncoverml.learn module¶

Handles calling learning methods on models.

uncoverml.learn.local_learn_model(x_all, targets_all, config)¶

Trains a model. Handles special case of parallel models.

Parameters

x_all (np.ndarray) – All covariate data, shape (n_samples, n_features), sorted using X, Y of target positions.
targets_all (np.ndarray) – All target data, shape (n_samples), sorted using X, Y of target positions.
config (Config) – Config object.

Returns

A trained Model.

Return type

Model

uncoverml.likelihoods module¶

Likelihood functions that can be used with revrand.

Can be used with revrand’s GeneralisedLinearModel class for specialised regression tasks such as basement depth estimation from censored and uncensored depth observations.

class uncoverml.likelihoods.Switching(lenscale=1.0, var_init=Parameter(value=1.0, bounds=Positive(upper=None), shape=()))¶

Bases: revrand.likelihoods.Bernoulli

Ey(f, var, z)¶

Expected value of the Bernoulli likelihood.

Parameters: f (ndarray) – latent function from the GLM prior (\(\mathbf{f} = \boldsymbol\Phi \mathbf{w}\))
Returns: Ey – expected value of y, \(\mathbb{E}[\mathbf{y}|\mathbf{f}]\).
Return type: ndarray

cdf(y, f, var, z)¶

Cumulative density function of the likelihood.

Parameters

y (ndarray) – query quantiles, i.e. \(P(Y \leq y)\).
f (ndarray) – latent function from the GLM prior (\(\mathbf{f} = \boldsymbol\Phi \mathbf{w}\))

Returns

cdf – Cumulative density function evaluated at y.

Return type

ndarray

df(y, f, var, z)¶

Derivative of Bernoulli log likelihood w.r.t. f.

Parameters

y (ndarray) – array of 0, 1 valued integers of targets
f (ndarray) – latent function from the GLM prior (\(\mathbf{f} = \boldsymbol\Phi \mathbf{w}\))

Returns

df – the derivative \(\partial \log p(y|f) / \partial f\)

Return type

ndarray

dp(y, f, var, z)¶

Derivative of Bernoulli log likelihood w.r.t.the parameters, \(\theta\).

Parameters

y (ndarray) – array of 0, 1 valued integers of targets
f (ndarray) – latent function from the GLM prior (\(\mathbf{f} = \boldsymbol\Phi \mathbf{w}\))

Returns

dp – the derivative \(\partial \log p(y|f, \theta)/ \partial \theta\) for each parameter. If there is only one parameter, this is not a list.

Return type

list, float or ndarray

loglike(y, f, var, z)¶

Bernoulli log likelihood.

Parameters

y (ndarray) – array of 0, 1 valued integers of targets
f (ndarray) – latent function from the GLM prior (\(\mathbf{f} = \boldsymbol\Phi \mathbf{w}\))

Returns

logp – the log likelihood of each y given each f under this likelihood.

Return type

ndarray

class uncoverml.likelihoods.UnifGauss(lenscale=1.0)¶

Bases: revrand.likelihoods.Bernoulli

Ey(f)¶

Expected value of the Bernoulli likelihood.

Parameters: f (ndarray) – latent function from the GLM prior (\(\mathbf{f} = \boldsymbol\Phi \mathbf{w}\))
Returns: Ey – expected value of y, \(\mathbb{E}[\mathbf{y}|\mathbf{f}]\).
Return type: ndarray

cdf(y, f)¶

Cumulative density function of the likelihood.

Parameters

y (ndarray) – query quantiles, i.e. \(P(Y \leq y)\).
f (ndarray) – latent function from the GLM prior (\(\mathbf{f} = \boldsymbol\Phi \mathbf{w}\))

Returns

cdf – Cumulative density function evaluated at y.

Return type

ndarray

df(y, f)¶

Derivative of Bernoulli log likelihood w.r.t. f.

Parameters

y (ndarray) – array of 0, 1 valued integers of targets
f (ndarray) – latent function from the GLM prior (\(\mathbf{f} = \boldsymbol\Phi \mathbf{w}\))

Returns

df – the derivative \(\partial \log p(y|f) / \partial f\)

Return type

ndarray

loglike(y, f)¶

Bernoulli log likelihood.

Parameters

y (ndarray) – array of 0, 1 valued integers of targets
f (ndarray) – latent function from the GLM prior (\(\mathbf{f} = \boldsymbol\Phi \mathbf{w}\))

Returns

logp – the log likelihood of each y given each f under this likelihood.

Return type

ndarray

pdf(y, f)¶

uncoverml.metadata_profiler module¶

Description:: Gather Metadata for the uncover-ml prediction output results:

Reference: email 2019-05-24 Overview Creator: (person who generated the model) Model;

Name: Type and date: Algorithm: Extent: Lat/long - location on Australia map?

SB Notes: None of the above is required as this information will be captured in the yaml file.

Model inputs:

Covariates - list (in full)

2. Targets: path to shapefile: csv file SB Notes: Only covaraite list file. Targets and path to shapefile is not required as this is available in the yaml file. May be the full path to the shapefile has some merit as one can specify partial path.

Model performance: JSON file (in full)

SB Notes: Yes

Model outputs

Prediction grid including path
Quantiles Q5; Q95
Variance:
Entropy:
Feature rank file
Raw covariates file (target value - covariate value)
Optimisation output

8. Others ?? SB Notes: Not required as these are model dependent, and the metadata will be contained in each of the output geotif file.

Model parameters: 1. YAML file (in full) 2. .SH file (in full) SB Notes: The .sh file is not required. YAML file is read as a python dictionary in uncoverml which can be dumped in the metadata.

CreationDate: 31/05/19 Developer: fei.zhang@ga.gov.au

Revision History:: LastUpdate: 31/05/19 FZ LastUpdate: dd/mm/yyyy Who Optional description

class uncoverml.metadata_profiler.MetadataSummary(model, config)¶

Bases: object

Summary Description of the ML prediction output

write_metadata(out_filename)¶: write the metadata for this prediction result, into a human-readable txt file. in order to make the ML results traceable and reproduceable (provenance)

uncoverml.mllog module¶

Logging config.

class uncoverml.mllog.ElapsedFormatter¶

Bases: object

format(record)¶

class uncoverml.mllog.MPIStreamHandler(stream=None)¶

Bases: logging.StreamHandler

If message stars with ‘:mpi:’, the message will be logged regardless of node (the ‘:mpi:’ will be removed from the message). Otherwise, only node 0 will emit messages.

emit(record)¶

Emit a record.

If a formatter is specified, it is used to format the record. The record is then written to the stream with a trailing newline. If exception information is present, it is formatted using traceback.print_exception and appended to the stream. If the stream has an ‘encoding’ attribute, it is used to determine how to do the output to the stream.

uncoverml.mllog.configure(verbosity)¶

uncoverml.mllog.handle_exception(exc_type, exc_value, exc_traceback)¶: Add MPI index to exception traceback.

uncoverml.mllog.warn_with_traceback(message, category, filename, lineno, line=None)¶: copied from: http://stackoverflow.com/questions/22373927/get-traceback-of-warnings

uncoverml.models module¶

Model Objects and ML algorithm serialisation.

This module makes many of the models in scikit learn and revrand available to our pipeline, as well as augmenting their functionality with, for examples, target transformations.

This table is a quick breakdown of the advantages and disadvantages of the various algorithms we can use in this pipeline.

Algorithm	Learning Scalability	Modelling Capacity	Prediction Speed	Probabilistic
Bayesian linear regression	+ + +	+	+ + + +	Yes
Approx. Gaussian process	+ +	+ + + +	+ + + +	Yes
SGD linear regression	+ + + +	+	+ + +	Yes
SGD Gaussian process	+ + + +	+ + + +	+ + +	Yes
Support Vector Regression	+	+ + + +	+	No
Random Forest Regression	+ + +	+ + + +	+ +	Pseudo
Cubist Regression	+ + +	+ + + +	+ +	Pseudo
ARD Regression	+ +	+ +	+ + +	No
Extremely Randomized Reg.	+ + +	+ + + +	+ +	No
Decision Tree Regression	+ + +	+ + +	+ + + +	No

class uncoverml.models.ARDRegressionTransformed(target_transform='identity', *args, **kwargs)¶

Bases: uncoverml.models.transform_targets.<locals>.TransformedRegressor, uncoverml.models.TagsMixin

ARD regression.

http://scikit-learn.org/dev/modules/generated/sklearn.linear_model.ARDRegression.html#sklearn.linear_model.ARDRegression

class uncoverml.models.ApproxGP(kernel='rbf', nbases=50, lenscale=1.0, var=1.0, regulariser=1.0, ard=True, tol=1e-08, maxiter=1000, nstarts=100)¶

Bases: uncoverml.models.BasisMakerMixin, revrand.slm.StandardLinearModel, uncoverml.models.PredictDistMixin, uncoverml.models.MutualInfoMixin

An approximate Gaussian process for medium scale data.

Parameters

kernel (str, optional) – the (approximate) kernel to use with this Gaussian process. Have a look at basismap dictionary for appropriate kernel approximations.
nbases (int) – how many unique random bases to create (twice this number will be actually created, i.e. real and imaginary components for each base). The higher this number, the more accurate the kernel approximation, but the longer the runtime of the algorithm. Usually if X is high dimensional, this will have to also be high dimensional.
lenscale (float, optional) – the initial value for the kernel length scale to be learned.
ard (bool, optional) – Whether to use a different length scale for each dimension of X or a single length scale. This will result in a longer run time, but potentially better results.
var (Parameter, optional) – observation variance initial value.
regulariser (Parameter, optional) – weight regulariser (variance) initial value.
tol (float, optional) – optimiser function tolerance convergence criterion.
maxiter (int, optional) – maximum number of iterations for the optimiser.
nstarts (int, optional) – if there are any parameters with distributions as initial values, this determines how many random candidate starts shoulds be evaluated before commencing optimisation at the best candidate.

class uncoverml.models.ApproxGPTransformed(target_transform='identity', *args, **kwargs)¶

Bases: uncoverml.models.transform_targets.<locals>.TransformedRegressor, uncoverml.models.TagsMixin

Approximate Gaussian process.

http://nicta.github.io/revrand/slm.html

class uncoverml.models.BasisMakerMixin¶

Bases: object

Mixin class for easily creating approximate kernel functions for revrand.

This is primarily used for the approximate Gaussian process algorithms.

fit(X, y, *args, **kwargs)¶

class uncoverml.models.BootstrappedSVR(n_models=100, parallel=True, *args, **kwargs)¶: Bases: uncoverml.models.bootstrap_model.<locals>.BootstrappedModel, uncoverml.models.TagsMixin

class uncoverml.models.CubistMultiTransformed(target_transform='identity', *args, **kwargs)¶

Bases: uncoverml.models.transform_targets.<locals>.TransformedRegressor, uncoverml.models.TagsMixin

Parallel Cubist regression (wrapper).

https://www.rulequest.com/cubist-info.html

class uncoverml.models.CubistTransformed(target_transform='identity', *args, **kwargs)¶

Bases: uncoverml.models.transform_targets.<locals>.TransformedRegressor, uncoverml.models.TagsMixin

Cubist regression (wrapper).

https://www.rulequest.com/cubist-info.html

class uncoverml.models.CustomKNeighborsRegressor(n_neighbors=10, weights='distance', algorithm='auto', leaf_size=30, metric='minkowski', p=2, metric_params=None, n_jobs=1, min_distance=0.0)¶: Bases: sklearn.neighbors._regression.KNeighborsRegressor

class uncoverml.models.DecisionTreeTransformed(target_transform='identity', *args, **kwargs)¶

Bases: uncoverml.models.transform_targets.<locals>.TransformedRegressor, uncoverml.models.TagsMixin

Decision tree regression.

http://scikit-learn.org/dev/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor

class uncoverml.models.ExtraTreeTransformed(target_transform='identity', *args, **kwargs)¶

Bases: uncoverml.models.transform_targets.<locals>.TransformedRegressor, uncoverml.models.TagsMixin

Extremely randomised tree regressor.

http://scikit-learn.org/dev/modules/generated/sklearn.tree.ExtraTreeRegressor.html#sklearn.tree.ExtraTreeRegressor

class uncoverml.models.GLMPredictDistMixin¶

Bases: object

Mixin class for providing a predict_dist method to the GeneralisedLinearModel class in revrand.

This is especially for use with Gaussian likelihood models.

predict_dist(X, interval=0.95, *args, **kwargs)¶

Predictive mean and variance for a probabilistic regressor.

Parameters

X (ndarray) – (Ns, d) array query dataset (Ns samples, d dimensions).
interval (float, optional) – The percentile confidence interval (e.g. 95%) to return.
fields (dict, optional) – dictionary of fields parsed from the shape file. indicator_field should be a key in this dictionary. If this is not present, then a Gaussian likelihood will be used for all predictions. The only time this may be input if for cross validation.

Returns

Ey (ndarray) – The expected value of ys for the query inputs, X of shape (Ns,).
Vy (ndarray) – The expected variance of ys (excluding likelihood noise terms) for the query inputs, X of shape (Ns,).
ql (ndarray) – The lower end point of the interval with shape (Ns,)
qu (ndarray) – The upper end point of the interval with shape (Ns,)

class uncoverml.models.GradBoostedTrees(*args, **kwargs)¶

Bases: uncoverml.models.encode_targets.<locals>.EncodedClassifier, uncoverml.models.TagsMixin

Gradient Boosted Trees multi-class classification.

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html

class uncoverml.models.KNearestNeighborTransformed(target_transform='identity', *args, **kwargs)¶

Bases: uncoverml.models.transform_targets.<locals>.TransformedRegressor, uncoverml.models.TagsMixin

K Nearest Neighbour Regression

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html

class uncoverml.models.LinearReg(onescol=True, var=1.0, regulariser=1.0, tol=1e-08, maxiter=1000, nstarts=100)¶

Bases: revrand.slm.StandardLinearModel, uncoverml.models.PredictDistMixin, uncoverml.models.MutualInfoMixin

Bayesian standard linear model.

Parameters

onescol (bool, optional) – If true, prepend a column of ones onto X (i.e. a bias term)
var (Parameter, optional) – observation variance initial value.
regulariser (Parameter, optional) – weight regulariser (variance) initial value.
tol (float, optional) – optimiser function tolerance convergence criterion.
maxiter (int, optional) – maximum number of iterations for the optimiser.
nstarts (int, optional) – if there are any parameters with distributions as initial values, this determines how many random candidate starts shoulds be evaluated before commencing optimisation at the best candidate.

class uncoverml.models.LinearRegTransformed(target_transform='identity', *args, **kwargs)¶

Bases: uncoverml.models.transform_targets.<locals>.TransformedRegressor, uncoverml.models.TagsMixin

Bayesian linear regression.

http://nicta.github.io/revrand/slm.html

class uncoverml.models.LogisticClassifier(*args, **kwargs)¶

Bases: uncoverml.models.encode_targets.<locals>.EncodedClassifier, uncoverml.models.TagsMixin

Logistic Regression for muli-class classification.

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

class uncoverml.models.LogisticRBF(*args, **kwargs)¶

Bases: uncoverml.models.encode_targets.<locals>.EncodedClassifier, uncoverml.models.TagsMixin

Approximate large scale kernel logistic regression.

class uncoverml.models.MaskRows(*Xs)¶

Bases: object

apply_mask(X)¶

apply_masks(*Xs)¶

static get_complete_rows(X)¶

trim_mask(X)¶

trim_masks(*Xs)¶

class uncoverml.models.MultiRandomForestTransformed(target_transform='identity', *args, **kwargs)¶

Bases: uncoverml.models.transform_targets.<locals>.TransformedRegressor, uncoverml.models.TagsMixin

MPI implementation of Random forest regression with forest grown on many CPUS.

http://scikit-learn.org/dev/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor

class uncoverml.models.MutualInfoMixin¶

Bases: object

Mixin class for providing predictive entropy reduction functionality to the StandardLinearModel class (only).

entropy_reduction(X)¶

Predictice entropy reduction (a.k.a mutual information).

Estimate the reduction in the posterior distribution’s entropy (i.e. model uncertainty reduction) as a result of including a particular observation.

Parameters: X (ndarray) – (Ns, d) array query dataset (Ns samples, d dimensions).
Returns: MI – Prediction of mutual information (expected reduiction in posterior entrpy) assocated with each query input. The units are ‘nats’, and the shape of the returned array is (Ns,).
Return type: ndarray

class uncoverml.models.PredictDistMixin¶

Bases: object

Mixin class for providing a predict_dist method to the StandardLinearModel class in revrand.

predict_dist(X, interval=0.95, *args, **kwargs)¶

Predictive mean and variance for a probabilistic regressor.

Parameters

X (ndarray) – (Ns, d) array query dataset (Ns samples, d dimensions).
interval (float, optional) – The percentile confidence interval (e.g. 95%) to return.
fields (dict, optional) – dictionary of fields parsed from the shape file. indicator_field should be a key in this dictionary. If this is not present, then a Gaussian likelihood will be used for all predictions. The only time this may be input if for cross validation.

Returns

Ey (ndarray) – The expected value of ys for the query inputs, X of shape (Ns,).
Vy (ndarray) – The expected variance of ys (excluding likelihood noise terms) for the query inputs, X of shape (Ns,).
ql (ndarray) – The lower end point of the interval with shape (Ns,)
qu (ndarray) – The upper end point of the interval with shape (Ns,)

class uncoverml.models.RandomForestClassifier(*args, **kwargs)¶

Bases: uncoverml.models.encode_targets.<locals>.EncodedClassifier, uncoverml.models.TagsMixin

Random Forest for muli-class classification.

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

class uncoverml.models.RandomForestRegressor(n_estimators=100, criterion='mse', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, ccp_alpha=0.0, max_samples=None)¶

Bases: sklearn.ensemble._forest.RandomForestRegressor

Implements a “probabilistic” output by looking at the variance of the decision tree estimator ouputs.

predict_dist(X, interval=0.95)¶

class uncoverml.models.RandomForestRegressorMulti(outdir='.', forests=10, parallel=True, n_estimators=10, random_state=1, **kwargs)¶

Bases: object

fit(x, y, *args, **kwargs)¶

predict(x)¶

predict_dist(x, interval=0.95, *args, **kwargs)¶

class uncoverml.models.RandomForestTransformed(target_transform='identity', *args, **kwargs)¶

Bases: uncoverml.models.transform_targets.<locals>.TransformedRegressor, uncoverml.models.TagsMixin

Random forest regression.

http://scikit-learn.org/dev/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor

class uncoverml.models.SGDApproxGP(kernel='rbf', nbases=50, lenscale=1.0, var=1.0, regulariser=1.0, ard=True, maxiter=3000, batch_size=10, alpha=0.01, beta1=0.9, beta2=0.99, epsilon=1e-08, random_state=1, nstarts=500)¶

Bases: uncoverml.models.BasisMakerMixin, revrand.glm.GeneralisedLinearModel, uncoverml.models.GLMPredictDistMixin

An approximate Gaussian process for large scale data using stochastic gradients.

This uses the Adam stochastic gradients algorithm; http://arxiv.org/pdf/1412.6980

Parameters

kern (str, optional) – the (approximate) kernel to use with this Gaussian process. Have a look at basismap dictionary for appropriate kernel approximations.
nbases (int) – how many unique random bases to create (twice this number will be actually created, i.e. real and imaginary components for each base). The higher this number, the more accurate the kernel approximation, but the longer the runtime of the algorithm. Usually if X is high dimensional, this will have to also be high dimensional.
lenscale (float, optional) – the initial value for the kernel length scale to be learned.
ard (bool, optional) – Whether to use a different length scale for each dimension of X or a single length scale. This will result in a longer run time, but potentially better results.
var (float, optional) – observation variance initial value.
regulariser (float, optional) – weight regulariser (variance) initial value.
maxiter (int, optional) – Number of iterations to run for the stochastic gradients algorithm.
batch_size (int, optional) – number of observations to use per SGD batch.
alpha (float, optional) – stepsize to give the stochastic gradient optimisation update.
beta1 (float, optional) – smoothing/decay rate parameter for the stochastic gradient, must be [0, 1].
beta2 (float, optional) – smoothing/decay rate parameter for the squared stochastic gradient, must be [0, 1].
epsilon (float, optional) – “jitter” term to ensure continued learning in stochastic gradients (should be small).
random_state (int or RandomState, optional) – random seed
nstarts (int, optional) – if there are any parameters with distributions as initial values, this determines how many random candidate starts shoulds be evaluated before commencing optimisation at the best candidate.

Note

Setting the random_state may be important for getting consistent looking predictions when many chunks/subchunks are used. This is because the predictive distribution is sampled for these algorithms!

class uncoverml.models.SGDApproxGPTransformed(target_transform='identity', *args, **kwargs)¶

Bases: uncoverml.models.transform_targets.<locals>.TransformedRegressor, uncoverml.models.TagsMixin

Approximate Gaussian processes with stochastic gradients.

http://nicta.github.io/revrand/glm.html

class uncoverml.models.SGDLinearReg(onescol=True, var=1.0, regulariser=1.0, maxiter=3000, batch_size=10, alpha=0.01, beta1=0.9, beta2=0.99, epsilon=1e-08, random_state=None, nstarts=500)¶

Bases: revrand.glm.GeneralisedLinearModel, uncoverml.models.GLMPredictDistMixin

Bayesian standard linear model, using stochastic gradients.

This uses the Adam stochastic gradients algorithm; http://arxiv.org/pdf/1412.6980

Parameters

onescol (bool, optional) – If true, prepend a column of ones onto X (i.e. a bias term)
var (Parameter, optional) – observation variance initial value.
regulariser (Parameter, optional) – weight regulariser (variance) initial value.
maxiter (int, optional) – Number of iterations to run for the stochastic gradients algorithm.
batch_size (int, optional) – number of observations to use per SGD batch.
alpha (float, optional) – stepsize to give the stochastic gradient optimisation update.
beta1 (float, optional) – smoothing/decay rate parameter for the stochastic gradient, must be [0, 1].
beta2 (float, optional) – smoothing/decay rate parameter for the squared stochastic gradient, must be [0, 1].
epsilon (float, optional) – “jitter” term to ensure continued learning in stochastic gradients (should be small).
random_state (int or RandomState, optional) – random seed
nstarts (int, optional) – if there are any parameters with distributions as initial values, this determines how many random candidate starts shoulds be evaluated before commencing optimisation at the best candidate.

Note

Setting the random_state may be important for getting consistent looking predictions when many chunks/subchunks are used. This is because the predictive distribution is sampled for these algorithms!

class uncoverml.models.SGDLinearRegTransformed(target_transform='identity', *args, **kwargs)¶

Bases: uncoverml.models.transform_targets.<locals>.TransformedRegressor, uncoverml.models.TagsMixin

Baysian linear regression with stochastic gradients.

http://nicta.github.io/revrand/glm.html

class uncoverml.models.SVRTransformed(target_transform='identity', *args, **kwargs)¶

Bases: uncoverml.models.transform_targets.<locals>.TransformedRegressor, uncoverml.models.TagsMixin

Support vector machine.

http://scikit-learn.org/dev/modules/svm.html#svm

class uncoverml.models.SupportVectorClassifier(*args, **kwargs)¶

Bases: uncoverml.models.encode_targets.<locals>.EncodedClassifier, uncoverml.models.TagsMixin

Support Vector Machine multi-class classification.

http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

class uncoverml.models.TagsMixin¶

Bases: object

Mixin class to aid a pipeline in establishing the types of predictive outputs to be expected from the ML algorithms in this module.

get_predict_tags()¶

Get the types of prediction outputs from this algorithm.

Returns: of strings with the types of outputs that can be returned by this algorithm. This depends on the prediction methods implemented (e.g. predict, predict_dist`, entropy_reduction).
Return type: list

class uncoverml.models.TransformedCTInterpolator(target_transform='identity', *args, **kwargs)¶: Bases: uncoverml.models.transform_targets.<locals>.TransformedRegressor, uncoverml.models.TagsMixin

class uncoverml.models.TransformedLinearNDInterpolator(target_transform='identity', *args, **kwargs)¶: Bases: uncoverml.models.transform_targets.<locals>.TransformedRegressor, uncoverml.models.TagsMixin

class uncoverml.models.TransformedNearestNDInterpolator(target_transform='identity', *args, **kwargs)¶: Bases: uncoverml.models.transform_targets.<locals>.TransformedRegressor, uncoverml.models.TagsMixin

class uncoverml.models.TransformedRbfInterpolator(target_transform='identity', *args, **kwargs)¶: Bases: uncoverml.models.transform_targets.<locals>.TransformedRegressor, uncoverml.models.TagsMixin

uncoverml.models.apply_masked(func, data, *args, **kwargs)¶

uncoverml.models.apply_multiple_masked(func, data, *args, **kwargs)¶

uncoverml.models.bootstrap_model(model)¶

uncoverml.models.encode_targets(Classifier)¶

uncoverml.models.kernelize(classifier)¶

uncoverml.models.transform_targets(Regressor)¶

Factory function that add’s target transformation capabiltiy to compatible scikit learn objects.

Look at the transformers.py module for more information on valid target transformers.

Example

>>> svr = transform_targets(SVR)(target_transform='Standardise', gamma=0.1)

uncoverml.mpiops module¶

uncoverml.mpiops.chunk_index = 0¶

the index (from zero) of this node in the MPI world. Also known as the rank of the node.

Type: int

uncoverml.mpiops.chunks = 1¶

the total number of nodes in the MPI world

Type: int

uncoverml.mpiops.comm = <mpi4py.MPI.Intracomm object>¶: module-level MPI ‘world’ object representing all connected nodes

uncoverml.mpiops.count(x)¶

uncoverml.mpiops.count_targets(targets)¶

uncoverml.mpiops.covariance(x)¶

uncoverml.mpiops.create_shared_array(data, root=0, writeable=False)¶

Create a shared numpy array among MPI nodes. To access the data, refer to the return numpy array ‘shared’. The second return value is the MPI window. This doesn’t need to be interacted with except when deallocating the memory.

When finished with the data, set shared = None and call win.Free().

Caution: any node with a handle on the shared array can modify its contents. To be safe, the shared array is set to read-only by default.

Parameters

data (numpy.ndarray) – The numpy array to share.
root (int) – Rank of the root node that contains the original data.
writeable (bool) – Whether or not the resulting shared array is writeable.

Returns

Return type

tuple of numpy.ndarray, MPI window

uncoverml.mpiops.eigen_decomposition(x)¶

uncoverml.mpiops.max_axis_0(x, y, dtype)¶

uncoverml.mpiops.mean(x)¶

uncoverml.mpiops.min_axis_0(x, y, dtype)¶

uncoverml.mpiops.minimum(x)¶

uncoverml.mpiops.outer(x)¶

uncoverml.mpiops.outer_count(x)¶

uncoverml.mpiops.power(x, exp)¶

uncoverml.mpiops.random_full_points(x, Napprox)¶

uncoverml.mpiops.run_once(f, *args, **kwargs)¶

Run a function on one node, broadcast result to all This function evaluates a function on a single node in the MPI world, then broadcasts the result of that function to every node in the world. :param f: The function to be evaluated. Can take arbitrary arguments and return

anything or nothing

Parameters

args (optional) – Other positional arguments to pass on to f
kwargs (optional) – Other named arguments to pass on to f

Returns

The value returned by f

Return type

result

uncoverml.mpiops.sd(x)¶

uncoverml.mpiops.sum_axis_0(x, y, dtype)¶

uncoverml.mpiops.unique(sets1, sets2, dtype)¶

uncoverml.patch module¶

Image patch extraction and windowing utilities.

uncoverml.patch.all_patches(image, patchsize)¶

uncoverml.patch.grid_patches(image, pwidth)¶

Generate (overlapping) patches from an image. This function extracts square patches from an image in an overlapping, dense grid.

Parameters

image (ndarray) – an array of shape (x, y) or (x, y, channels).
pwidth (int) – the half-width of the square patches to extract, in pixels. E.g. pwidth = 0 gives a 1x1 patch, pwidth = 1 gives a 3x3 patch, pwidth = 2 gives a 5x5 patch etc. The formula for calculating the full patch width is pwidth * 2 + 1.

Returns

patch – An image of shape (x, y, channels*psize*psize), where psize = pwidth * 2 + 1

Return type

ndarray

uncoverml.patch.patches_at_target(image, patchsize, targets)¶

uncoverml.patch.point_patches(image, pwidth, points)¶

Extract patches from an image at specified points.

Parameters

image (ndarray) – an array of shape (x, y, channels).
pwidth (int) – the half-width of the square patches to extract, in pixels. E.g. pwidth = 0 gives a 1x1 patch, pwidth = 1 gives a 3x3 patch, pwidth = 2 gives a 5x5 patch etc. The formula for calculating the full patch width is pwidth * 2 + 1.
points (ndarray) – of shape (N, 2) where there are N points, each with an x and y coordinate of the patch centre within the image.

Returns

patches – An image patch array of shape (N, psize, psize, channels), where psize = pwidth * 2 + 1

Return type

ndarray

uncoverml.predict module¶

uncoverml.predict.cluster_analysis(x, y, partition_no, config, feature_names)¶

Parameters

x (ndarray) – array of dim (Ns, d)
y (ndarry) – array of predictions of dimension (Ns, 1)
partition_no (int) – partition number of the image
config (config object) –
feature_names (list) – list of strings corresponding to ordered feature names

uncoverml.predict.div0(a, b)¶: ignore / 0, div0( [-1, 0, 1], 0 ) -> [0, 0, 0]

uncoverml.predict.final_cluster_analysis(n_classes, n_paritions)¶

uncoverml.predict.predict(data, model, interval=0.95, **kwargs)¶

uncoverml.predict.render_partition(model, subchunk, image_out, config)¶

uncoverml.predict.shapefile_prediction(config, model)¶

uncoverml.predict.write_mean_and_sd(x, y, writer, config)¶

uncoverml.resampling module¶

Module for shapefile resampling methods. This code was originailly developed by Sudipta Basak. (https://github.com/basaks)

See uncoverml.scripts.shiftmap_cli for a resampling CLI.

uncoverml.resampling.bootstrap_data_indicies(population, samples=None, random_state=1)¶

uncoverml.resampling.filter_fields(fields_to_keep, gdf)¶

uncoverml.resampling.prepapre_dataframe(data, fields_to_keep)¶

uncoverml.resampling.resample_by_magnitude(input_data, target_field, bins=10, interval='percentile', fields_to_keep=[], bootstrap=True, output_samples=None, validation=False, validation_points=100)¶

Parameters

input_gdf (geopandas.GeoDataFrame) – Geopandas dataframe containing targets to be resampled.
target_field (str) – target field name based on which resampling is performed. Field must exist in the input_shapefile
bins (int) – number of bins for sampling
fields_to_keep (list) – of strings to store in the output shapefile
bootstrap (bool, optional) – whether to sample with replacement or not
output_samples (int, optional) – number of samples in the output shpfile. If not provided, the output samples will be assumed to be the same as the original shapefile
validation (bool, optional) – validation file name
validation_points (int, optional) – approximate number of points in the validation shapefile

uncoverml.resampling.resample_spatially(input_data, target_field, rows=10, cols=10, fields_to_keep=[], bootstrap=True, output_samples=None, validation_points=100)¶

Parameters

input_shapefile –
output_shapefile –
target_field (str) – target field name based on which resampling is performed. Field must exist in the input_shapefile
rows (int, optional) – number of bins in y
cols (int, optional) – number of bins in x
fields_to_keep (list of strings to store in the output shapefile) –
bootstrap (bool, optional) – whether to sample with replacement or not
output_samples (int, optional) – number of samples in the output shpfile. If not provided, the output samples will be assumed to be the same as the original shapefile
validation_points (int, optional) – approximate number of points in the validation shapefile

Returns

Return type

output_shapefile name

uncoverml.targets module¶

class uncoverml.targets.Targets(lonlat, vals, othervals=None)¶

Bases: object

classmethod from_geodataframe(gdf, observations_field='observations')¶

Returns a Targets object from a geopandas dataframe. One column will be taken as the main ‘observations’ field. All remaining non-geometry columns will be stored in the fields property.

Parameters: observations_field (str) – Name of the column in the dataframe that is the main target observation (the field to train on).
Returns
Return type: Targets

to_geodataframe()¶

Returns a copy of the targets as a geopandas dataframe.

Returns
Return type: geopandas.GeoDataFrame

uncoverml.targets.gather_targets(targets, keep, node=None)¶

uncoverml.targets.gather_targets_main(targets, keep, node)¶

uncoverml.targets.generate_covariate_shift_targets(targets, bounds, n_points)¶

uncoverml.targets.generate_dummy_targets(bounds, label, n_points, field_keys=[], seed=1)¶

Generate dummy points with randomly generated positions. Points are generated on node 0 and distributed to other nodes if running in parallel.

Parameters

bounds (tuple of float) – Bounding box to generate targets within, of format (xmin, ymin, xmax, ymax).
label (str) – Label to assign generated targets.
n_points (int) – Number of points to generate
field_keys (list of str, optional) – List of keys to add to fields property.
seed (int, optional) – Random number generator seed.

Returns

A collection of randomly generated targets.

Return type

Targets

uncoverml.targets.label_targets(targets, label, backup_field=None)¶

Replaces target observations (the target property being trained on) with the given label.

Parameters

targets (Targets) – A collection of targets to label.
label (str) – The label to apply.
backup_field (str) – If present, copies the original observation data to the fields property with the provided string as the key.

Returns

The labelled targets.

Return type

Targets

uncoverml.targets.merge_targets(a, b)¶

Merges two Targets objects. They will be sorted the canonical uncover-ml way: lexically by position (y, x).

Parameters

a (Target) – The Targets to merge.
b (Target) – The Targets to merge.

Returns

A single merged collection of targets.

Return type

Targets

uncoverml.targets.save_dropped_targets(config, keep, targets)¶

uncoverml.targets.save_targets(targets, path, obs_filter=None)¶

Saves target positions and observation data to a CSV file.

Parameters

targets (Targets) – The targets to save.
path (str) – Path to file to save as.
obs_filter (any, optional) – If provided, will only save points that have this observation data.

uncoverml.validate module¶

Scripts for validation

class uncoverml.validate.CrossvalInfo(scores, y_true, y_pred, classification, positions)¶

Bases: object

export_crossval(config)¶

Exports a CSV file containing real target values and their corresponding predicted value generated as part of cross-validation.

Also populates the ‘prediction’ column of the ‘rawcovariates’ CSV file.

If enabled, the real vs predicted values will be plotted.

Parameters: config (Config) – Uncover-ml config object.

class uncoverml.validate.OOSInfo(scores, y_true, y_pred, classification, positions)¶

Bases: uncoverml.validate.CrossvalInfo

export_scores(config)¶

uncoverml.validate.adjusted_r2_score(r2, n_samples, n_covariates)¶

uncoverml.validate.classification_validation_scores(ys, eys, pys)¶

Calculates the validation scores for a regression prediction Given the test and training data, as well as the outputs from every model, this function calculates all of the applicable metrics in the following list, and returns a dictionary with the following (possible) keys:

accuracy

log_loss

f1

Parameters

ys (numpy.array) – The test data outputs, one-hot representation
eys (numpy.array) – The (hard) predictions made by the trained model on test data, one-hot representation
pys (numpy.array) – The probabilistic predictions made by the trained model on test data

Returns

scores – A dictionary containing all of the evaluated scores.

Return type

dict

uncoverml.validate.local_crossval(x_all, targets_all, config)¶

Performs K-fold cross validation to test the applicability of a model. Given a set of inputs and outputs, this function will evaluate the effectiveness of a model at predicting the targets, by splitting all of the known data. A model is trained on a subset of the total data, and then this model is used to predict all of the unseen targets, its performance can provide a benchmark to evaluate the effectiveness of a model.

Parameters

x_all (numpy.array) – A 2D array containing all of the training inputs
targets_all (numpy.array) – A 1D vector containing all of the training outputs
config (dict) – The global config object, which is used to choose the model to train.

Returns

result – A dictionary containing all of the cross validation metrics, evaluated on the unseen data subset.

Return type

dict

uncoverml.validate.local_rank_features(image_chunk_sets, transform_sets, targets, config)¶

Ranks the importance of the features based on their performance. This function trains and cross-validates a model with each individual feature removed and then measures the performance of the model with that feature removed. The most important feature is the one which; when removed, causes the greatest degradation in the performance of the model.

Parameters

image_chunk_sets (dict) – A dictionary used to get the set of images to test on.
transform_sets (list) – A dictionary containing the applied transformations
targets (instance of geoio.Targets class) – The targets used in the cross validation
config (config class instance) – The global config file

uncoverml.validate.out_of_sample_validation(model, targets, features, config)¶

uncoverml.validate.permutation_importance(model, x_all, targets_all, config)¶

uncoverml.validate.regression_validation_scores(y, ey, n_covariates, model)¶

Calculates the validation scores for a regression prediction Given the test and training data, as well as the outputs from every model, this function calculates all of the applicable metrics in the following list, and returns a dictionary with the following (possible) keys:

r2_score

expvar

smse

lins_ccc

mll

Parameters

y (numpy.array) – The test data outputs
ey (numpy.array) – The predictions made by the trained model on test data
n_covariates (int) – The number of covariates being used.

Returns

scores – A dictionary containing all of the evaluated scores.

Return type

dict

uncoverml.validate.split_cfold(nsamples, k=5, seed=None)¶

Function that returns indices for splitting data into random folds.

Parameters

nsamples (int) – the number of samples in the dataset
k (int, optional) – the number of folds
seed (int, optional) – random seed to provide to numpy

Returns

cvinds (list) – list of arrays of length k, each with approximate shape (nsamples / k,) of indices. These indices are randomly permuted (without replacement) of assignments to each fold.
cvassigns (ndarray) – array of shape (nsamples,) with each element in [0, k), that can be used to assign data to a fold. This corresponds to the indices of cvinds.

uncoverml package¶

Subpackages¶

Submodules¶

uncoverml.cluster module¶

uncoverml.config module¶

uncoverml.cubist module¶

uncoverml.cubist_config module¶

uncoverml.diagnostics module¶

uncoverml.features module¶

uncoverml.filtering module¶

uncoverml.geoio module¶

uncoverml.image module¶

uncoverml.interpolate module¶

uncoverml.krige module¶

uncoverml.learn module¶

uncoverml.likelihoods module¶

uncoverml.metadata_profiler module¶

uncoverml.mllog module¶

uncoverml.models module¶

uncoverml.mpiops module¶

uncoverml.patch module¶

uncoverml.predict module¶

uncoverml.resampling module¶

uncoverml.targets module¶

uncoverml.validate module¶

Module contents¶