uncoverml package

Submodules

uncoverml.cluster module

class uncoverml.cluster.KMeans(k, oversample_factor)

Bases: object

Model object implementing learn and predict with K-means

Parameters
  • k (int > 0) – The number of classes to cluster the data into

  • oversample_factor (int > 1) – Controls the number of samples draws as part of [1] in the initialisation step. More mpi nodes will increase the total number of points. Consider values of 1 for more than about 16 nodes

References

1

Bahmani, Bahman, Benjamin Moseley, Andrea

Vattani, Ravi Kumar, and Sergei Vassilvitskii. “Scalable k-means++.” Proceedings of the VLDB Endowment 5, no. 7 (2012): 622-633.

get_predict_tags()
learn(x, indices=None, classes=None)

Find the cluster centres using k-means||

Parameters
  • x (ndarray) – (n_samples, n_dimensions) length array containing the training samples to cluster

  • indices (ndarray) – (n_samples) length integer array giving the locations in x where labels exist

  • classes (ndarray) – (n_samples) length integer array giving the class assignments of points in x in locations given by indices

predict(x, *args, **kwargs)
class uncoverml.cluster.TrainingData(indices, classes)

Bases: object

Light wrapper for the indices and values of training data

Parameters
  • indices (ndarray) – length N array of the indices of the input data that have classes assigned

  • classes (ndarray) – length N int array of the class values at locations specified by indices

uncoverml.cluster.centroid(X, weights=None)

Compute the centroid of a set of points X

The points X may have repetitions given by the weights.

Parameters
  • X (ndarray) – (n, d) array of n d-dimensional points

  • weights (ndarray (optional)) – (n,) array of weights giving the repetition (or mass?) of each X

Returns

centroid – (d,) length array, the d-dimensional centroid point of all x in X.

Return type

ndarray

uncoverml.cluster.compute_class(X, C, training_data=None)

Find the closest cluster centre for each x in X

This returns which cluster centre each X belongs to, with optional semi-supervised training data that will force an assignment of a point to a particular class

Parameters
  • X (ndarray) – (n, d) array of n d-dimensional points to be evaluated

  • C (ndarray) – (k, d) array of cluster centres, associated with classes 0..k-1

  • training_data (TrainingData (optional)) – instance of TrainingData containing fixed class assignments for particular points

Returns

  • classes (ndarray) – (n,) int array of class assignments (0..k-1) for each x in X

  • cost (float) – The total ‘cost’ of the assignment, which is the average distance of all points to their assigned centre

uncoverml.cluster.compute_n_classes(classes, config)

The number of cluster centres to use for K-means

Just handles the case where someone specifies k=5 but labels 10 classes in the training data. This will return k=10.

Parameters
  • classes (ndarray) – an array of hard class assignments given as training data

  • config (Config) – The app config class holding the number of classes asked for

Returns

k – The max of k and the number of classes referenced in the training data

Return type

int > 0

uncoverml.cluster.compute_weights(x, C)

Number of points in x assigned to each centre c in C

Parameters
  • x (ndarray) – (n, d) array of n d-dimensional points

  • C (ndarray) – (k, d) array of k cluster centres

Returns

weights – (k,) length array giving number of x closest to each c in C

Return type

ndarray

uncoverml.cluster.initialise_centres(X, k, l, training_data=None, max_iterations=1000)

Use Kmeans|| to find initial cluster centres

This algorithm finds generates log(n) candidate samples efficiently, then uses k-means to cluster them into k initial starting centres used in the main algorithm (clustering X)

Parameters
  • X (ndarray) – (n,d) array of points to cluster

  • k (int > 0) – number of clusters

  • l (float > 0) – Oversample factor. See weighted_starting_candidates.

  • training_data (TrainingData (optional)) – Optional hard assignments of certain points in X

  • max_iterations (int > 0) – The algorithm will terminate after this many iterations even if it hasn’t converged.

Returns

C_init – (k, d) array of starting cluster centres for clustering X with k-means.

Return type

ndarray

uncoverml.cluster.kmean_distance2(x, C)

Compute squared euclidian distance to the nearest cluster centre

Parameters
  • x (ndarray) – (n, d) array of n d-dimensional points

  • C (ndarray) – (k, d) array of k cluster centres

Returns

d2_x – (n,) length array of distances from each x to the nearest centre

Return type

ndarray

uncoverml.cluster.kmeans_step(X, C, classes, weights=None)

A single step of the k-means algorithm.

Assigns every point in X a centre, then computes the centroid of all x assigned to each centre, then updates that centre to be the new centroid.

Parameters
  • X (ndarray) – (n, d) array of points to be clustered

  • C (ndarray) – (k, d) array of initial cluster centres

  • classes (ndarray) – (n,) array of initial class assignments

  • weights (ndarray (optional)) – weights for points x in X that allow for different ‘masses’ or repetitions in the centroid calculation

Returns

C_new – (k, d) array of new cluster centres

Return type

ndarray

uncoverml.cluster.log = <Logger uncoverml.cluster (WARNING)>

Never use more than this many x’s to compute a distance matrix (save memory!)

uncoverml.cluster.reseed_point(X, C, index)

Re-initialise the centre of a class if it loses all its members

This should almost never happen. If it does, find the point furthest from all the other cluster centres and use that. Maybe a bad idea but a decent first pass

Parameters
  • X (ndarray) – (n, d) array of points

  • C (ndarray) – (k, d) array of cluster centres

  • index (int >= 0) – index between 0..k-1 of the cluster that has lost it’s points

Returns

new_point – d-dimensional point for replacing the empty cluster centre.

Return type

ndarray

uncoverml.cluster.run_kmeans(X, C, k, weights=None, training_data=None, max_iterations=1000)

Cluster points into k clusters using K-means

This is a distributed implementation of Johnson’s algorithm that performs a convex optimization to find the locally optimal assignment of points and cluster centres. It depends heavily on the inital cluster centres C

Parameters
  • X (ndarray) – (n, d) array n d-dimensional of points to cluster

  • C (ndarray) – (k, d) array of initial cluster centres

  • k (int > 0) – number of clusters

  • weights (ndarray (optional)) – (n,) array of optional repetition weights for points in X, A weight of 2. implies there are 2 points at that location

  • training_data (TrainingData (optional)) – An instance of the TrainingData class containing fixed cluster assignments for some of the x in X

  • max_iterations (int > 0 (optional)) – The algorithm will return after this many iterations, even if it hasn’t converged

Returns

  • C (ndarray) – (k, d) array of final cluster centres, ordered (0..k-1)

  • classes (ndarray) – (n,) array of class assignments (0..k-1) for each x in X

uncoverml.cluster.sum_axis_0(x, y, dtype)

Reduce operation that sums 2 arrays on axis zero

uncoverml.cluster.weighted_starting_candidates(X, k, l)

Generate (weighted) candidates to initialise the full k-means

See the kmeans|| algorithm/paper for details. The goal is to find points that are good starting cluster centres for a full kmeans using only log(n) passes through the data

Parameters
  • X (ndarray) – (n, d) array of n d-dimensional points to be clustered

  • k (int > 0) – number of clusters

  • l (float > 0) – The ‘oversample factor’ that controls how many candidates are found. Candidates are found independently on each node so this can be smaller with a bigger computation.

Returns

  • w (ndarray) – The ‘weights’ of the cluster centres, which are the number of points in X closest to each centre

  • C (ndarray) – The cluster centres themselves. The total candidates is not known beforehand so the array will be shaped (z, d) where z is some number that increases with l.

uncoverml.config module

Handles parsing of the configuration file.

class uncoverml.config.Config(yaml_file, clustering=False, learning=False, resampling=False, predicting=False, shiftmap=True)

Bases: object

Class representing the global configuration of the uncoverml scripts.

This class is mostly read-only, but it does also contain the Transform objects which have state. In some execution paths, config flags are switched off then back on (e.g. in cross validation).

Along with the YAML file, the init also takes some flags. These are set by the top-level CLI scripts and are used to determine what parameters to load and what can be ignored.

All attributes following output_dir (located at the bottom of init) are undocumented but should be self-explanatory. They are full paths to output for different features.

Todo

Factor out stateful Transform objects.

Parameters
  • yaml_file (str) – The path to the yaml config file.

  • clustering (bool) – True if clustering.

  • learning (bool) – True if learning.

  • resampling (bool) – True if resampling.

  • predicting (bool) – True if predicting.

name

Name oo the config file.

Type

str

algorithm

Name of the model to train. See Models for available models.

Type

str

algorithm_args

A dictionary of arguments to pass to selected model. See Models for available arguments to model. Key is the argument name exactly as it appears in model __init__ (this dict gets passed as kwargs).

Type

dict(str, any)

cubist

True if cubist algorithm is being used.

Type

bool

multicubist

True if multicubist algorithm is being used.

Type

bool

multirandomforest

True if multirandomforest algorithm is being used.

Type

bool

krige

True if kriging is being used.

Type

bool

bootstrap

True if a bootstrapped algorithm is being used.

Type

bool

clustering

True if clustering is being performed.

Type

bool

n_classes

Number of classes to cluster into. Required if clustering.

Type

int

oversample_factor

Controls how many candidates are found for cluster initialisation when running kmeans clustering. See weighted_starting_candidates(). Required when clustering.

Type

float

cluster_analysis

True if analysis should be performed post-clustering. Optional, default is False.

Type

bool, optional

class_file

Define classes for clustering feature data. Path to shapefile that defines class at positions.

Type

str or bytes, optional

semi_supervised

True if semi_supervised clustering is being performed (i.e. class_file has been provided).

Type

bool

True if target_search feature is being used.

Type

bool

target_search_threshold

Target search threshold, float between 0 and 1. The likelihood a training point must surpass to be included in found points.

Type

float

target_search_extents

A bounding box defining the image area to search for additional targets.

Type

tuple(float, float, float, float)

tse_are_pixel_coordinates

If True, target_search_extents are treated as pixel coordinates instead of CRS coordinates.

Type

bool

extents

A bounding box defining the area to learn and predict on. Data outside these extents gets cropped. Optional, if not provided whole image area is used.

Type

tuple(float, float, float, float), optional

extents_are_pixel_coordinates

If True, extents are treated as pixel coordinates instead of CRS coordinates.

Type

bool

pk_covarates

Path to where to save pickled covariates, or a pre-existing covariate pickle file if loading pickled covariates.

Type

str or bytes

pk_targets

Path to where to save pickled targets, or a pre-existing target pickle file if loading pickled targets.

Type

str or bytes

pk_load

True if both pk_covariates and pk_targets are provided and these paths exist (it’s assumed they contain the correct pickled data).

Type

bool

feature_sets

The provided features as FeatureSetConfig objects. These contain paths to the feature files and importantly the Transform objects which contain statistics used to transform the covariates. These Transform objects and contained statistics must be maintained across workflow steps (aka CLI commands).

Type

FeatureSetConfig

patchsize

Half-width of the patches that feature data will be chunked into. Height/width of each patch is equal to patchsize * 2 + 1.

Todo

Not implemented, defaults to 1.

Type

int

target_file

Path to a shapefile defining the targets to be trained on.

Type

str or bytes

target_property

Name of the field in the target_file to be used as training property.

Type

str

target_weight_property

Name of the field in the target_file to be used as target weights.

Type

str, optional

fields_to_write_to_csv

List of field names in the target_file to be included in output table.

Type

list(str), optional

shiftmap_targets

Path to a shapefile containing targets to generate shiftmap from. This is optional, by default shiftmap will generate dummy targets by randomly sampling the target shapefile.

Type

str or bytes, optional

spatial_resampling_args

Kwargs for spatial resampling. See Resampling for more details.

Type

dict

value_resampling_args

Kwargs for value resampling. See Resampling for more details.

Type

dict

final_transform

Transforms to apply to whole image set after other preprocessing has been performed.

Type

TransformSet

oos_percentage

Float between 0 and 1. The percentage of targets to withhold from training to be used in out-of-sample validation.

Type

float, optional

oos_shapefile

Shapefile containing targets to be used in out-of-sample validation.

Type

str or bytes, optional

oos_property

Name of the property in oos_shapefile to be used in validation. Only required if an OOS shapefile is provided.

Type

str

out_of_sample_validation

True if out of sample validation is to be performed.

Type

bool

rank_features

True if ‘feature_ranking’ is True in ‘validation’ block of the config. Turns on feature ranking. Default is False.

Type

bool, optional

permutation_importance

True if ‘permutation_importance’ is True in ‘validation’ block of the config. Turns on permutation importance. Default is False.

Type

bool

parallel_validate

True if ‘parallel’ is present in ‘k-fold’ block of config. Turns on parallel k-fold cross validation. Default is False.

Type

bool, optional

cross_validate

True if ‘k-fold’ block is present in ‘validation’ block of config. Turns on k-fold cross validation.

Type

bool, optional

folds

The number of folds to split dataset into for cross validation. Required if cross_validate is True.

Type

int

crossval_seed

Seed for random sorting of folds for cross validation. Required if cross_validate is True.

Type

int

optimisation

Dictionary of optimisation arguments. See Optimisation for details.

Type

dict

geotiff_options

Optional creation options passed to the geotiff output driver. See https://gdal.org/drivers/raster/gtiff.html#creation-options for a list of creation options.

Type

dict, optional

quantiles

Prediction quantile/interval for predicted values.

Type

float

outbands

The outbands to write in the prediction output file. Used as the ‘stop’ for a slice taken from list of prediction tags, i.e. [0: outbands]. If the resulting slice is greater than the number of tags available, then all tags will be selected. If no value is provied, then all tags will be selected.

Todo

Having this as a slice is questionable. Should be simplified.

Type

int

thumbnails

Subsampling factor for thumbnails of output images. Default is 10.

Type

int, optional

bootstrap_predictions

Only applies if a bootstrapped algorithm is being used. This is the number of predictions to perform, by default will predict on all sub-models. E.g. if you had a BS algorithm containing 100 sub-models, you could limit a test prediction to 20 using this parameter to speed things up.

Type

int, optional

mask

Path to a geotiff file for masking the output prediction map. Only values that have been masked will be predicted.

Type

str, optional

retain

Value in the above mask that indicates cell should be retained and predicted. Must be provided if a mask is provided.

Type

int

lon_lat

Dictionary containing paths to longitude and latitude grids used in kriging.

Type

dict, optional

output_dir

Path to directory where prediciton map and other outputs will be written.

Type

str

static parse_extents(exb)

Validates extents parameters.

set_algo_flags()

Convenience method for setting boolean flags based on the algorithm being used.

property tmpdir

Convenience method for creating tmpdir needed by some UncoverML functionality.

yaml_loader

alias of yaml.loader.SafeLoader

exception uncoverml.config.ConfigException

Bases: Exception

class uncoverml.config.FeatureSetConfig(config_dict)

Bases: object

Config class representing a ‘feature set’ in the config file.

Parameters

config_dict (dict) – The section of the yaml file for a feature set.

name

Name of the feature set.

Type

str

type

Data type of the feature set (‘categorical’ or ‘ordinal’).

Type

str

files

Absolute paths to .tif files of the feature set.

Type

list of str

transform_set

Transforms specified for the feautre set.

Type

ImageTransformSet

uncoverml.cubist module

class uncoverml.cubist.Cubist(name='temp', print_output=False, unbiased=True, max_rules=None, committee_members=1, max_categories=5000, sampling=None, seed=None, neighbors=None, feature_type=None, composite_model=False, auto=False, extrapolation=None, calc_usage=False, bootstrap=None)

Bases: object

This class wraps the cubist command line tools in a scikit-learn interface. The learning phase relies on the cubist command line tools, whereas the predictions themselves are executed directly in python.

fit(x, y)

Train the Cubist model Given a matrix of values (X) and an output vector of values (y), this method will train the cubist model and then read the training files directly as parameters of this class.

Parameters
  • x (numpy.array) – X contains all of the training inputs, This should be a matrix of values, where x.shape[0] = n, where n is the number of available training points.

  • y (numpy.array) – y contains the output target variables for each corresponding input vector. Again we expect y.shape[0] = n.

predict(x)

Predicts the y values that correspond to each input Just like predict_dist, this predicts the output value, given a list of inputs contained in x.

Parameters

x (numpy.array) – The inputs for which the model should be evaluated

Returns

y_mean – An array of expected output values given the inputs

Return type

numpy.array

predict_dist(x, interval=0.95)

Predict the outputs and variances of the inputs This method predicts the output values that would correspond to each input in X. This method also returns the certainty of the model in each case, which is only sensible when the number of commitee members is greater than one.

This method also outputs quantile information along with the variance to establish the probability distribution clearly.

Parameters
  • x (numpy.array) – The inputs for which the model should be evaluated

  • interval (float) – The probability threshold for which the quantiles should be output.

Returns

  • y_mean (numpy.array) – An array of expected output values given the inputs

  • y_var (numpy.array) – The variance of the outputs

  • ql (numpy.array) – The lower quantiles for each input

  • qu (numpy.array) – The upper quantiles for each input

class uncoverml.cubist.CubistReportRow(cond, model, feature)

Bases: object

convenience class for accumulating cubist report

class uncoverml.cubist.MultiCubist(outdir='.', trees=10, print_output=False, unbiased=True, max_rules=None, committee_members=1, max_categories=5000, neighbors=None, feature_type=None, sampling=70, seed=None, extrapolation=None, composite_model=False, auto=False, parallel=False, calc_usage=False, bootstrap=None)

Bases: object

This is a wrapper on Cubist.

calculate_usage()

Averages the Cond and Model statistics of all the cubist runs

fit(x, y)

Train the Cubist model Given a matrix of values (X) and an output vector of values (y), this method will train the cubist model and then read the training files directly as parameters of this class.

Parameters
  • x (numpy.array) – X contains all of the training inputs, This should be a matrix of values, where x.shape[0] = n, where n is the number of available training points.

  • y (numpy.array) – y contains the output target variables for each corresponding input vector. Again we expect y.shape[0] = n.

predict(x)

Predicts the y values that correspond to each input Just like predict_dist, this predicts the output value, given a list of inputs contained in x.

Parameters

x (numpy.array) – The inputs for which the model should be evaluated

Returns

y_mean – An array of expected output values given the inputs

Return type

numpy.array

predict_dist(x, interval=0.95)

Predict the outputs and variances of the inputs This method predicts the output values that would correspond to each input in X. This method also returns the certainty of the model in each case, which is only sensible when the number of commitee members is greater than one.

This method also outputs quantile information along with the variance to establish the probability distribution clearly.

Parameters
  • x (numpy.array) – The inputs for which the model should be evaluated

  • interval (float) – The probability threshold for which the quantiles should be output.

Returns

  • y_mean (numpy.array) – An array of expected output values given the inputs

  • y_var (numpy.array) – The variance of the outputs

  • ql (numpy.array) – The lower quantiles for each input

  • qu (numpy.array) – The upper quantiles for each input

class uncoverml.cubist.Rule(rule, m)

Bases: object

comparator = {'<': <ufunc 'less'>, '<=': <ufunc 'less_equal'>, '=': <ufunc 'equal'>, '>': <ufunc 'greater'>, '>=': <ufunc 'greater_equal'>}
regress(x, mask=None)
satisfied(x)
uncoverml.cubist.arguments(p)
uncoverml.cubist.cond_line(line)
uncoverml.cubist.mean(numbers)
uncoverml.cubist.pairwise(iterable)
uncoverml.cubist.parse_float_array(arraystring)
uncoverml.cubist.read_data(filename)
uncoverml.cubist.remove_first_line(line)
uncoverml.cubist.save_data(filename, data)
uncoverml.cubist.variance_with_mean(mean)
uncoverml.cubist.write_dict(filename, dict_obj)

uncoverml.cubist_config module

uncoverml.diagnostics module

This module contains functionality for plotting validation scores and other diagnostic information.

uncoverml.diagnostics.plot_covariate_correlation(path, method='pearson')

Plots matrix of correlation between covariates.

Parameters
  • path (str) – Path to ‘rawcovariates’ CSV file.

  • method (str, optional) – Correlation coefficient to calculate. Choices are ‘pearson’, ‘kendall’, ‘spearman’. Default is ‘pearson’.

Returns

The matrix plot as a matplotlib Figure.

Return type

obj:matplotlib.figure.Figure

uncoverml.diagnostics.plot_covariates_x_targets(path, cols=2, subplot_width=8, subplot_height=4)

Plots scatter plots of each covariate intersected with target values.

Parameters
  • path (str) – Path to ‘rawcovariates’ CSV file containing intersection of targets and covariates.

  • cols (int, optional) – The number of columns to split the figure into. Default is 1.

  • subplot_width (int) – Width of each subplot in inches. Default is 8.

  • subplot_height (int) – Width of each subplot in inches. Default is 4.

Returns

The scatter plots as a matplotlib Figure.

Return type

obj:matplotlib.figure.Figure

uncoverml.diagnostics.plot_feature_rank_curves(path, subplot_width=8, subplot_height=4)

Plots curves for feature ranking of each metric.

Parameters
  • path (str) – Path to ‘featureranks’ JSON file.

  • subplot_width (int, optional) – Width of each subplot. Default is 8.

  • subplot_height (int, optional) – Height of each subplot. Default is 4.

Returns

The plots as a matplotlib Figure.

Return type

obj:matplotlib.figure.Figure

uncoverml.diagnostics.plot_feature_ranks(path, barwidth=0.08, figsize=(15, 9))

Plots a grouped bar chart of feature rank scores, grouped by covariate. Depending on the number of covariates and metrics being calculated you may need to tweak barwidth and figsize so the bars fit.

Parameters
  • path (str) – Path to JSON file containing feature ranking results.

  • barwidth (float, optional) – Width of the bars.

  • figsize (tuple(float, float), optional) – The (width, height) of the figure in inches.

Returns

The bar chart as a matplotlib Figure.

Return type

obj:matplotlib.figure.Figure

uncoverml.diagnostics.plot_real_vs_pred_crossval(crossval_path, scores_path=None, bins=20, overlay=False, hist_cm=None, scatter_color=None, figsize=(25, 12.5), point_size=None)
uncoverml.diagnostics.plot_real_vs_pred_prediction(rc_path, pred_path, scores_path=None, bins=20, overlay=False, hist_cm=None, scatter_color=None, figsize=(25, 12.5), point_size=None)
uncoverml.diagnostics.plot_residual_error_crossval(crossval_path, bins=20)
uncoverml.diagnostics.plot_residual_error_prediction(rc_path, pred_path, bins=20)
uncoverml.diagnostics.plot_target_scaling(path, bins=20, title='Target Scaling', sharey=False)

Plots histograms of target values pre and post-scaling.

Parameters
  • path (str) – Path to ‘transformed_targets’ CSV file.

  • bins (int, optional) – The number of value bins for the histograms. Default is 20.

  • title (str, optional) – The title of the plot. Defaults to ‘Target Scaling’.

  • sharey (bool) – Whether the plots will share a y-axis and scale. Default is False.

Returns

The histograms as a matplotlib Figure.

Return type

obj:maplotlib.figure.Figure

uncoverml.features module

uncoverml.features.cull_all_null_rows(feature_sets)
uncoverml.features.extract_features(image_source, targets, n_subchunks, patchsize)

each node gets its own share of the targets, so all nodes will always have targets

uncoverml.features.extract_subchunks(image_source, subchunk_index, n_subchunks, patchsize)
uncoverml.features.features_from_shapefile(feature_sets, mask=None)
uncoverml.features.gather_features(x, node=None)
uncoverml.features.intersect_shapefile_features(targets, feature_sets, target_drop_values)

Extract covariates from a shapefile. This is done by intersecting targets with the shapefile. The shapefile must have the same number of rows as there are targets.

Drop target values here for tabular predictions. This is mainly for convenience if there are classes or points in the target file that we don’t want to predict on for whatever reason (e.g. out-of-sample validation purposes). It’s done here rather than when targets are first loaded so we don’t also have to handle a mask + targets being returned from target loading as the mask won’t be required in most situations.

Parameters
  • targets (uncoverml.targets.Targets) – An uncoverml.targets.Targets object that has been loaded from a shapefile.

  • feature_sets (list of uncoverml.config.FeatureSetConfig) – A list of feature sets of ‘tabular’ type (sourced from shapefiles). Each set must have an attribute file that points to the shapefile to load and attribute fields which is the list of fields to retrieve as covariates from the file.

  • target_drop_values (list of any) – A list of values where if target observation is equal to value that row is dropped and also won’t be intersected with the covariates.

uncoverml.features.remove_missing(x, targets=None)
uncoverml.features.save_intersected_features_and_targets(feature_sets, transform_sets, targets, config, impute=True)

This function saves a table of covariate values and the target value intersected at each point. It also contains columns for UID ‘index’ and a predicted value.

If the target shapefile contains an ‘index’ field, this will be used to populate the ‘index’ column. This is intended to be used as a unique ID for each point in post-processing. If no ‘index’ field exists this column will be zero filled.

The ‘prediction’ column is for predicted values created during cross-validation. Again, this is for post-processing. It will only be populated if cross-validation is run later on. If not, it will be zero filled.

Two files will be output:

…/output_dir/{name_of_config}_rawcovariates.csv …/output_dir/{name_of_config}_rawcovariates_mask.csv

This function will also optionally output intersected covariates scatter plot and covariate correlation matrix plot.

uncoverml.features.transform_features(feature_sets, transform_sets, final_transform, config)

uncoverml.filtering module

Code for computing the gamma sensor footprint, and for applying and unapplying spatial convolution filters to a given image.

BM: this is used in scripts/gammasensor_cli.py - I haven’t used it in my time with uncoverml or seen it used.

uncoverml.filtering.fwd_filter(img, S)
uncoverml.filtering.inv_filter(img, S, noise=0.001)
uncoverml.filtering.kernel_impute(img, S)
uncoverml.filtering.pad2(img)
uncoverml.filtering.sensor_footprint(img_w, img_h, res_x, res_y, height, mu_air)

uncoverml.geoio module

class uncoverml.geoio.ArrayImageSource(A, origin, crs, pixsize)

Bases: uncoverml.geoio.ImageSource

An image source that uses an internally stored numpy array

Parameters
  • A (MaskedArray) – masked array of shape (xpix, ypix, channels) that contains the image data.

  • origin (ndarray) – Array of the form [lonmin, latmin] that defines the origin of the image

  • pixsize (ndarray) – Array of the form [pixsize_x, pixsize_y] defining the size of a pixel

data(min_x, max_x, min_y, max_y)
class uncoverml.geoio.ImageSource

Bases: object

property crs
abstract data(min_x, max_x, min_y, max_y)
property dtype
property full_resolution
property nodata_value
property origin_latitude
property origin_longitude
property pixsize_x
property pixsize_y
class uncoverml.geoio.ImageWriter(shape, bbox, crs, n_subchunks, outpath, outbands, band_tags=None, independent=False, **kwargs)

Bases: object

close()
nodata_value = array(-1.e+20, dtype=float32)
output_thumbnails(ratio=10)
write(x, subchunk_index)
Parameters
  • x

  • subchunk_index

  • independent – bool independent image writing by different processes, i.e., images are not chunked

Returns

class uncoverml.geoio.RasterioImageSource(filename)

Bases: uncoverml.geoio.ImageSource

data(min_x, max_x, min_y, max_y)
uncoverml.geoio.SharedTrainingData

alias of uncoverml.geoio.TrainingData

uncoverml.geoio.create_shared_training_data(targets_all, x_all)
uncoverml.geoio.crop_covariates(config, outdir=None)

Crops the covariate files listed under config.feature_sets using the bounds provided under config.extents. The cropped covariates are stored in a temporary directory and the paths in config.feature_sets are redirected to theses files. The caller is responsible for removing the files once they have been created.

Parameters
  • config (uncoverml.config.Config) – Parsed UncoverML config.

  • outdir (str) – Aboslute path to directory to store cropped covariates. If not provided, a tmp directory will be created.

uncoverml.geoio.crop_mask(config, outdir=None)

Crops the prediction mask listed under config.mask.

uncoverml.geoio.crop_tif(filename, extents, pixel_coordinates=False, outfile=None, strict=False)

Crops the geotiff using the provided extent.

Parameters
  • filename (str) – Path to the geotiff to be cropped.

  • extents (tuple(float, float, float, float)) – Bounding box to crop by, ordering is (xmin, ymin, xmax, ymax). Data outside bounds will be cropped. Any elements that are None will be substituted with the original bound of the geotiff.

  • outfile (str) – Path to save cropped geotiff. If not provided, will be saved with original name + random id in tmp directory.

uncoverml.geoio.deallocate_shared_training_data(training_data)
uncoverml.geoio.distribute_targets(positions, observations, fields)

Distributes a target object across all nodes

uncoverml.geoio.export_feature_ranks(measures, feats, scores, config)
uncoverml.geoio.export_model(model, config)
uncoverml.geoio.feature_names(config)
uncoverml.geoio.get_image_bounds(config)
uncoverml.geoio.get_image_crs(config)
uncoverml.geoio.get_image_pixel_res(config)
uncoverml.geoio.get_image_spec(model, config)
uncoverml.geoio.image_feature_sets(targets, config)
uncoverml.geoio.image_resolutions(config)
uncoverml.geoio.image_subchunks(subchunk_index, config)
uncoverml.geoio.load_shapefile(filename, targetfield, covariate_crs, extents)

TODO

uncoverml.geoio.load_targets(shapefile, targetfield=None, covariate_crs=None, extents=None)

Loads the shapefile onto node 0 then distributes it across all available nodes.

Important: here the concatenated targets get sorted on the root processor by position (Y,X). It’s important that this order is preserved. Once covariates are intersected with the target data, they are also in this order. This ordering is what keeps the target and feature arrays synced.

uncoverml.geoio.resample(input_tif, output_tif, ratio, resampling=5)
Parameters
  • input_tif (str or rasterio.io.DatasetReader) – input file path or rasterio.io.DatasetReader object

  • output_tif (str) – output file path

  • ratio (float) – ratio by which to shrink/expand ratio > 1 means shrink

  • resampling (int, optional) – default is 5 (average) resampling. Other options are as follows: nearest = 0 bilinear = 1 cubic = 2 cubic_spline = 3 lanczos = 4 average = 5 mode = 6 gauss = 7 max = 8 min = 9 med = 10 q1 = 11 q3 = 12

uncoverml.geoio.semisupervised_feature_sets(targets, config)
uncoverml.geoio.unsupervised_feature_sets(config)
uncoverml.geoio.write_shapefile_prediction(pred, pred_tags, positions, config)

uncoverml.image module

Contains class and routines for reading chunked portions of images.

class uncoverml.image.Image(source, chunk_idx=0, nchunks=1, overlap=0)

Bases: object

Represents a raster Image. Can use to get a georeferenced chunk of an Image and the data associated with it. This class is mainly used in the features module for intersecting image chunks with target data and extracting the image data. It’s also used in geoio for getting covariate specs, such as CRS and bounds.

If nchunks > 1, then the Image is striped horizontally. Chunk_idx 0 is the first strip of the image. The X range covers the full width of the image and the Y ranges from 0 to image_height / n_chunks.

Parameters
  • source (ImageSource) – An instance of ImageSource (typically RasterioImageSource). Defines the image to be loaded.

  • chunk_idx (int) – Which chunk of the image is being loaded.

  • nchunks (int) – Total number of chunks being used. This is typically set by the partitions parameter of the top level command, also set as n_subchunks on the Config object.

  • overlap (int) – Doesn’t seem to be used, but appears to be used for accomodating overlap in chunks (number of rows to overlap with bounding strips).

property channels
data()
property dtype
in_bounds(lonlat)
lonlat2pix(lonlat)
property nodata_value
property npoints
patched_bbox(patchsize)
patched_shape(patchsize)
pix2lonlat(xy)
property x_range
property xmax
property xmin
property xres
property y_range
property ymax
property ymin
property yres
uncoverml.image.bbox2affine(xmax, xmin, ymax, ymin, xres, yres)
uncoverml.image.construct_splits(npixels, nchunks, overlap=0)

Splits the image horizontally into approximately equal strips according to npixels / nchunks.

uncoverml.interpolate module

class uncoverml.interpolate.SKLearnCT(fill_value=0, rescale=False, maxiter=1000, tol=0.0001)

Bases: sklearn.base.BaseEstimator, sklearn.base.RegressorMixin

Scikit-learn wrapper for

scipy.interpolate.CloughTocher2DInterpolator class.

CloughTocher2DInterpolator(points, values, tol=1e-6)

Piecewise cubic, C1 smooth, curvature-minimizing interpolant in 2D.

New in version 0.9.

__call__()
Parameters
  • points (ndarray of floats, shape (npoints, ndims); or Delaunay) – Data point coordinates, or a precomputed Delaunay triangulation.

  • values (ndarray of float or complex, shape (npoints, ..)) – Data values.

  • fill_value (float, optional) – Value used to fill in for requested points outside of the convex hull of the input points. If not provided, then the default is nan.

  • tol (float, optional) – Absolute/relative tolerance for gradient estimation.

  • maxiter (int, optional) – Maximum number of iterations in gradient estimation.

  • rescale (bool, optional) – Rescale points to unit cube before performing interpolation. This is useful if some of the input dimensions have incommensurable units and differ by many orders of magnitude.

Notes

The interpolant is constructed by triangulating the input data with Qhull [1]_, and constructing a piecewise cubic interpolating Bezier polynomial on each triangle, using a Clough-Tocher scheme [CT]. The interpolant is guaranteed to be continuously differentiable.

The gradients of the interpolant are chosen so that the curvature of the interpolating surface is approximatively minimized. The gradients necessary for this are estimated using the global algorithm described in [Nielson83,Renka84]_.

References

1

http://www.qhull.org/

CT

See, for example, P. Alfeld, ‘’A trivariate Clough-Tocher scheme for tetrahedral data’’. Computer Aided Geometric Design, 1, 169 (1984); G. Farin, ‘’Triangular Bernstein-Bezier patches’’. Computer Aided Geometric Design, 3, 83 (1986).

Nielson83

G. Nielson, ‘’A method for interpolating scattered data based upon a minimum norm network’’. Math. Comp., 40, 253 (1983).

Renka84

R. J. Renka and A. K. Cline. ‘’A Triangle-based C1 interpolation method.’’, Rocky Mountain J. Math., 14, 223 (1984).

fit(X, y)
predict(X)
class uncoverml.interpolate.SKLearnLinearNDInterpolator(fill_value=0, rescale=False)

Bases: sklearn.base.BaseEstimator, sklearn.base.RegressorMixin

Scikit-learn wrapper for

scipy.interpolate.LinearNDInterpolator class.

LinearNDInterpolator(points, values, fill_value=np.nan, rescale=False)

Piecewise linear interpolant in N dimensions.

New in version 0.9.

__call__()
Parameters
  • points (ndarray of floats, shape (npoints, ndims); or Delaunay) – Data point coordinates, or a precomputed Delaunay triangulation.

  • values (ndarray of float or complex, shape (npoints, ..)) – Data values.

  • fill_value (float, optional) – Value used to fill in for requested points outside of the convex hull of the input points. If not provided, then the default is nan.

  • rescale (bool, optional) – Rescale points to unit cube before performing interpolation. This is useful if some of the input dimensions have incommensurable units and differ by many orders of magnitude.

Notes

The interpolant is constructed by triangulating the input data with Qhull [1]_, and on each triangle performing linear barycentric interpolation.

References

1

http://www.qhull.org/

fit(X, y)
predict(X)
class uncoverml.interpolate.SKLearnNearestNDInterpolator(rescale=False, tree_options=None)

Bases: sklearn.base.BaseEstimator, sklearn.base.RegressorMixin

Scikit-learn wrapper for

scipy.interpolate.NearestNDInterpolator class.

NearestNDInterpolator(x, y)

Nearest-neighbour interpolation in N dimensions.

New in version 0.9.

__call__()
Parameters
  • x ((Npoints, Ndims) ndarray of floats) – Data point coordinates.

  • y ((Npoints,) ndarray of float or complex) – Data values.

  • rescale (boolean, optional) –

    Rescale points to unit cube before performing interpolation. This is useful if some of the input dimensions have incommensurable units and differ by many orders of magnitude.

    New in version 0.14.0.

  • tree_options (dict, optional) –

    Options passed to the underlying cKDTree.

    New in version 0.17.0.

Notes

Uses scipy.spatial.cKDTree

fit(X, y)
predict(X)
class uncoverml.interpolate.SKLearnRbf(function='multiquadric', smooth=0, norm='euclidean')

Bases: sklearn.base.BaseEstimator, sklearn.base.RegressorMixin

Scikit-learn wrapper for scipy.interpolate.Rbf class.

Rbf(*args)

A class for radial basis function approximation/interpolation of n-dimensional scattered data.

Parameters
  • *args (arrays) – x, y, z, …, d, where x, y, z, … are the coordinates of the nodes and d is the array of values at the nodes

  • function (str or callable, optional) –

    The radial basis function, based on the radius, r, given by the norm (default is Euclidean distance); the default is ‘multiquadric’:

    'multiquadric': sqrt((r/self.epsilon)**2 + 1)
    'inverse': 1.0/sqrt((r/self.epsilon)**2 + 1)
    'gaussian': exp(-(r/self.epsilon)**2)
    'linear': r
    'cubic': r**3
    'quintic': r**5
    'thin_plate': r**2 * log(r)
    

    If callable, then it must take 2 arguments (self, r). The epsilon parameter will be available as self.epsilon. Other keyword arguments passed in will be available as well.

  • epsilon (float, optional) – Adjustable constant for gaussian or multiquadrics functions - defaults to approximate average distance between nodes (which is a good start).

  • smooth (float, optional) – Values greater than zero increase the smoothness of the approximation. 0 is for interpolation (default), the function will always go through the nodal points in this case.

  • norm (str, callable, optional) – A function that returns the ‘distance’ between two points, with inputs as arrays of positions (x, y, z, …), and an output as an array of distance. E.g., the default: ‘euclidean’, such that the result is a matrix of the distances from each point in x1 to each point in x2. For more options, see documentation of scipy.spatial.distances.cdist.

N

The number of data points (as determined by the input arrays).

Type

int

di

The 1-D array of data values at each of the data coordinates xi.

Type

ndarray

xi

The 2-D array of data coordinates.

Type

ndarray

function

The radial basis function. See description under Parameters.

Type

str or callable

epsilon

Parameter used by gaussian or multiquadrics functions. See Parameters.

Type

float

smooth

Smoothing parameter. See description under Parameters.

Type

float

norm

The distance function. See description under Parameters.

Type

str or callable

nodes

A 1-D array of node values for the interpolation.

Type

ndarray

A
Type

internal property, do not use

Examples

>>> from scipy.interpolate import Rbf
>>> x, y, z, d = np.random.rand(4, 50)
>>> rbfi = Rbf(x, y, z, d)  # radial basis function interpolator instance
>>> xi = yi = zi = np.linspace(0, 1, 20)
>>> di = rbfi(xi, yi, zi)   # interpolated values
>>> di.shape
(20,)
fit(X, y)
predict(X)

uncoverml.krige module

class uncoverml.krige.Krige(method='ordinary', variogram_model='linear', nlags=6, weight=False, n_closest_points=10, verbose=False)

Bases: uncoverml.models.TagsMixin, sklearn.base.RegressorMixin, sklearn.base.BaseEstimator, uncoverml.krige.KrigePredictDistMixin

A scikitlearn wrapper class for Ordinary and Universal Kriging. This works for both Grid/RandomSearchCv for optimising the Krige parameters.

fit(x, y, *args, **kwargs)
Parameters
  • x (ndarray) – array of Points, (x, y) pairs

  • y (ndarray) – array of targets

predict(x, *args, **kwargs)
Parameters
  • x (ndarray) –

  • Returns

  • -------

  • array (Prediction) –

class uncoverml.krige.KrigePredictDistMixin

Bases: object

Mixin class for providing a predict_dist method to the Krige class.

This is especially for use with PyKrige Ordinary/UniversalKriging classes.

predict_dist(x, interval=0.95, *args, **kwargs)

Predictive mean and variance for a probabilistic regressor.

Parameters
  • x (ndarray) – (Ns, 2) array query dataset (Ns samples, 2 dimensions).

  • interval (float, optional) – The percentile confidence interval (e.g. 95%) to return.

Returns

  • prediction (ndarray) – The expected value of ys for the query inputs, X of shape (Ns,).

  • variance (ndarray) – The expected variance of ys (excluding likelihood noise terms) for the query inputs, X of shape (Ns,).

  • ql (ndarray) – The lower end point of the interval with shape (Ns,)

  • qu (ndarray) – The upper end point of the interval with shape (Ns,)

class uncoverml.krige.MLKrige(ml_method, ml_params={}, *args, **kwargs)

Bases: object

class uncoverml.krige.MLKrigeBase(ml_method, ml_params={}, method='ordinary', variogram_model='linear', n_closest_points=10, nlags=6, weight=False, verbose=False)

Bases: uncoverml.models.TagsMixin

This is an implementation of Regression-Kriging as described here: https://en.wikipedia.org/wiki/Regression-Kriging

fit(x, y, lon_lat, *args, **kwargs)

Fit the ML method and also Krige the residual.

Parameters
  • x (ndarray) – (Nt, d) array query dataset (Ns samples, d dimensions) for ML regression

  • y (ndarray) – array of targets (Nt, )

  • lon_lat – ndarray of (x, y) points. Needs to be a (Nt, 2) array corresponding to the lon/lat, for example.

krige_residual(lon_lat)
Parameters

lon_lat – ndarray of (x, y) points. Needs to be a (Ns, 2) array corresponding to the lon/lat, for example.

Returns

residual: ndarray

kriged residual values

ml_prediction(x, *args, **kwargs)
Parameters

x (ndarray) – regression matrix

Returns

  • ndarray

  • machine learning prediction

predict(x, lon_lat, *args, **kwargs)

Must override predict_dist method of Krige. Predictive mean and variance for a probabilistic regressor.

Parameters
  • x (ndarray) – (Ns, d) array query dataset (Ns samples, d dimensions) for ML regression

  • lon_lat – ndarray of (x, y) points. Needs to be a (Ns, 2) array corresponding to the lon/lat, for example.

Returns

pred – The expected value of ys for the query inputs, X of shape (Ns,).

Return type

ndarray

score(x, y, lon_lat, sample_weight=None)

Overloading default regression score method

class uncoverml.krige.MLKrigePredictDistMixin

Bases: object

predict_dist(x, interval=0.95, lon_lat=None, *args, **kwargs)

Predictive mean, variance, lower and upper quantile for a probabilistic regressor.

Parameters
  • X (ndarray) – (Ns, d) array query dataset (Ns samples, d dimensions) for ML regression

  • lon_lat – ndarray of (x, y) points. Needs to be a (Ns, 2) array corresponding to the lon/lat, for example.

  • interval (float, optional) – The percentile confidence interval (e.g. 95%) to return.

  • must contain a key lon_lat (kwargs) –

  • needs to be a (Ns (which) –

  • array (2)) –

  • to the lon/lat (corresponding) –

Returns

  • pred (ndarray) – The expected value of ys for the query inputs, X of shape (Ns,).

  • var (ndarray) – The expected variance of ys (excluding likelihood noise terms) for the query inputs, X of shape (Ns,).

  • ql (ndarray) – The lower end point of the interval with shape (Ns,)

  • qu (ndarray) – The upper end point of the interval with shape (Ns,)

class uncoverml.krige.MLKrigePreidctDist(*args, **kwargs)

Bases: uncoverml.krige.MLKrigeBase, uncoverml.krige.MLKrigePredictDistMixin

uncoverml.learn module

Handles calling learning methods on models.

uncoverml.learn.local_learn_model(x_all, targets_all, config)

Trains a model. Handles special case of parallel models.

Parameters
  • x_all (np.ndarray) – All covariate data, shape (n_samples, n_features), sorted using X, Y of target positions.

  • targets_all (np.ndarray) – All target data, shape (n_samples), sorted using X, Y of target positions.

  • config (Config) – Config object.

Returns

A trained Model.

Return type

Model

uncoverml.likelihoods module

Likelihood functions that can be used with revrand.

Can be used with revrand’s GeneralisedLinearModel class for specialised regression tasks such as basement depth estimation from censored and uncensored depth observations.

class uncoverml.likelihoods.Switching(lenscale=1.0, var_init=Parameter(value=1.0, bounds=Positive(upper=None), shape=()))

Bases: revrand.likelihoods.Bernoulli

Ey(f, var, z)

Expected value of the Bernoulli likelihood.

Parameters

f (ndarray) – latent function from the GLM prior (\(\mathbf{f} = \boldsymbol\Phi \mathbf{w}\))

Returns

Ey – expected value of y, \(\mathbb{E}[\mathbf{y}|\mathbf{f}]\).

Return type

ndarray

cdf(y, f, var, z)

Cumulative density function of the likelihood.

Parameters
  • y (ndarray) – query quantiles, i.e. \(P(Y \leq y)\).

  • f (ndarray) – latent function from the GLM prior (\(\mathbf{f} = \boldsymbol\Phi \mathbf{w}\))

Returns

cdf – Cumulative density function evaluated at y.

Return type

ndarray

df(y, f, var, z)

Derivative of Bernoulli log likelihood w.r.t. f.

Parameters
  • y (ndarray) – array of 0, 1 valued integers of targets

  • f (ndarray) – latent function from the GLM prior (\(\mathbf{f} = \boldsymbol\Phi \mathbf{w}\))

Returns

df – the derivative \(\partial \log p(y|f) / \partial f\)

Return type

ndarray

dp(y, f, var, z)

Derivative of Bernoulli log likelihood w.r.t.the parameters, \(\theta\).

Parameters
  • y (ndarray) – array of 0, 1 valued integers of targets

  • f (ndarray) – latent function from the GLM prior (\(\mathbf{f} = \boldsymbol\Phi \mathbf{w}\))

Returns

dp – the derivative \(\partial \log p(y|f, \theta)/ \partial \theta\) for each parameter. If there is only one parameter, this is not a list.

Return type

list, float or ndarray

loglike(y, f, var, z)

Bernoulli log likelihood.

Parameters
  • y (ndarray) – array of 0, 1 valued integers of targets

  • f (ndarray) – latent function from the GLM prior (\(\mathbf{f} = \boldsymbol\Phi \mathbf{w}\))

Returns

logp – the log likelihood of each y given each f under this likelihood.

Return type

ndarray

class uncoverml.likelihoods.UnifGauss(lenscale=1.0)

Bases: revrand.likelihoods.Bernoulli

Ey(f)

Expected value of the Bernoulli likelihood.

Parameters

f (ndarray) – latent function from the GLM prior (\(\mathbf{f} = \boldsymbol\Phi \mathbf{w}\))

Returns

Ey – expected value of y, \(\mathbb{E}[\mathbf{y}|\mathbf{f}]\).

Return type

ndarray

cdf(y, f)

Cumulative density function of the likelihood.

Parameters
  • y (ndarray) – query quantiles, i.e. \(P(Y \leq y)\).

  • f (ndarray) – latent function from the GLM prior (\(\mathbf{f} = \boldsymbol\Phi \mathbf{w}\))

Returns

cdf – Cumulative density function evaluated at y.

Return type

ndarray

df(y, f)

Derivative of Bernoulli log likelihood w.r.t. f.

Parameters
  • y (ndarray) – array of 0, 1 valued integers of targets

  • f (ndarray) – latent function from the GLM prior (\(\mathbf{f} = \boldsymbol\Phi \mathbf{w}\))

Returns

df – the derivative \(\partial \log p(y|f) / \partial f\)

Return type

ndarray

loglike(y, f)

Bernoulli log likelihood.

Parameters
  • y (ndarray) – array of 0, 1 valued integers of targets

  • f (ndarray) – latent function from the GLM prior (\(\mathbf{f} = \boldsymbol\Phi \mathbf{w}\))

Returns

logp – the log likelihood of each y given each f under this likelihood.

Return type

ndarray

pdf(y, f)

uncoverml.metadata_profiler module

Description:

Gather Metadata for the uncover-ml prediction output results:

Reference: email 2019-05-24 Overview Creator: (person who generated the model) Model;

Name: Type and date: Algorithm: Extent: Lat/long - location on Australia map?

SB Notes: None of the above is required as this information will be captured in the yaml file.

Model inputs:

  1. Covariates - list (in full)

2. Targets: path to shapefile: csv file SB Notes: Only covaraite list file. Targets and path to shapefile is not required as this is available in the yaml file. May be the full path to the shapefile has some merit as one can specify partial path.

Model performance

JSON file (in full)

SB Notes: Yes

Model outputs

  1. Prediction grid including path

  2. Quantiles Q5; Q95

  3. Variance:

  4. Entropy:

  5. Feature rank file

  6. Raw covariates file (target value - covariate value)

  7. Optimisation output

8. Others ?? SB Notes: Not required as these are model dependent, and the metadata will be contained in each of the output geotif file.

Model parameters: 1. YAML file (in full) 2. .SH file (in full) SB Notes: The .sh file is not required. YAML file is read as a python dictionary in uncoverml which can be dumped in the metadata.

CreationDate: 31/05/19 Developer: fei.zhang@ga.gov.au

Revision History:

LastUpdate: 31/05/19 FZ LastUpdate: dd/mm/yyyy Who Optional description

class uncoverml.metadata_profiler.MetadataSummary(model, config)

Bases: object

Summary Description of the ML prediction output

write_metadata(out_filename)

write the metadata for this prediction result, into a human-readable txt file. in order to make the ML results traceable and reproduceable (provenance)

uncoverml.mllog module

Logging config.

class uncoverml.mllog.ElapsedFormatter

Bases: object

format(record)
class uncoverml.mllog.MPIStreamHandler(stream=None)

Bases: logging.StreamHandler

If message stars with ‘:mpi:’, the message will be logged regardless of node (the ‘:mpi:’ will be removed from the message). Otherwise, only node 0 will emit messages.

emit(record)

Emit a record.

If a formatter is specified, it is used to format the record. The record is then written to the stream with a trailing newline. If exception information is present, it is formatted using traceback.print_exception and appended to the stream. If the stream has an ‘encoding’ attribute, it is used to determine how to do the output to the stream.

uncoverml.mllog.configure(verbosity)
uncoverml.mllog.handle_exception(exc_type, exc_value, exc_traceback)

Add MPI index to exception traceback.

uncoverml.mllog.warn_with_traceback(message, category, filename, lineno, line=None)

copied from: http://stackoverflow.com/questions/22373927/get-traceback-of-warnings

uncoverml.models module

Model Objects and ML algorithm serialisation.

This module makes many of the models in scikit learn and revrand available to our pipeline, as well as augmenting their functionality with, for examples, target transformations.

This table is a quick breakdown of the advantages and disadvantages of the various algorithms we can use in this pipeline.

Algorithm

Learning Scalability

Modelling Capacity

Prediction Speed

Probabilistic

Bayesian linear regression

+ + +

+

+ + + +

Yes

Approx. Gaussian process

+ +

+ + + +

+ + + +

Yes

SGD linear regression

+ + + +

+

+ + +

Yes

SGD Gaussian process

+ + + +

+ + + +

+ + +

Yes

Support Vector Regression

+

+ + + +

+

No

Random Forest Regression

+ + +

+ + + +

+ +

Pseudo

Cubist Regression

+ + +

+ + + +

+ +

Pseudo

ARD Regression

+ +

+ +

+ + +

No

Extremely Randomized Reg.

+ + +

+ + + +

+ +

No

Decision Tree Regression

+ + +

+ + +

+ + + +

No

class uncoverml.models.ARDRegressionTransformed(target_transform='identity', *args, **kwargs)

Bases: uncoverml.models.transform_targets.<locals>.TransformedRegressor, uncoverml.models.TagsMixin

ARD regression.

http://scikit-learn.org/dev/modules/generated/sklearn.linear_model.ARDRegression.html#sklearn.linear_model.ARDRegression

class uncoverml.models.ApproxGP(kernel='rbf', nbases=50, lenscale=1.0, var=1.0, regulariser=1.0, ard=True, tol=1e-08, maxiter=1000, nstarts=100)

Bases: uncoverml.models.BasisMakerMixin, revrand.slm.StandardLinearModel, uncoverml.models.PredictDistMixin, uncoverml.models.MutualInfoMixin

An approximate Gaussian process for medium scale data.

Parameters
  • kernel (str, optional) – the (approximate) kernel to use with this Gaussian process. Have a look at basismap dictionary for appropriate kernel approximations.

  • nbases (int) – how many unique random bases to create (twice this number will be actually created, i.e. real and imaginary components for each base). The higher this number, the more accurate the kernel approximation, but the longer the runtime of the algorithm. Usually if X is high dimensional, this will have to also be high dimensional.

  • lenscale (float, optional) – the initial value for the kernel length scale to be learned.

  • ard (bool, optional) – Whether to use a different length scale for each dimension of X or a single length scale. This will result in a longer run time, but potentially better results.

  • var (Parameter, optional) – observation variance initial value.

  • regulariser (Parameter, optional) – weight regulariser (variance) initial value.

  • tol (float, optional) – optimiser function tolerance convergence criterion.

  • maxiter (int, optional) – maximum number of iterations for the optimiser.

  • nstarts (int, optional) – if there are any parameters with distributions as initial values, this determines how many random candidate starts shoulds be evaluated before commencing optimisation at the best candidate.

class uncoverml.models.ApproxGPTransformed(target_transform='identity', *args, **kwargs)

Bases: uncoverml.models.transform_targets.<locals>.TransformedRegressor, uncoverml.models.TagsMixin

Approximate Gaussian process.

http://nicta.github.io/revrand/slm.html

class uncoverml.models.BasisMakerMixin

Bases: object

Mixin class for easily creating approximate kernel functions for revrand.

This is primarily used for the approximate Gaussian process algorithms.

fit(X, y, *args, **kwargs)
class uncoverml.models.BootstrappedSVR(n_models=100, parallel=True, *args, **kwargs)

Bases: uncoverml.models.bootstrap_model.<locals>.BootstrappedModel, uncoverml.models.TagsMixin

class uncoverml.models.CubistMultiTransformed(target_transform='identity', *args, **kwargs)

Bases: uncoverml.models.transform_targets.<locals>.TransformedRegressor, uncoverml.models.TagsMixin

Parallel Cubist regression (wrapper).

https://www.rulequest.com/cubist-info.html

class uncoverml.models.CubistTransformed(target_transform='identity', *args, **kwargs)

Bases: uncoverml.models.transform_targets.<locals>.TransformedRegressor, uncoverml.models.TagsMixin

Cubist regression (wrapper).

https://www.rulequest.com/cubist-info.html

class uncoverml.models.CustomKNeighborsRegressor(n_neighbors=10, weights='distance', algorithm='auto', leaf_size=30, metric='minkowski', p=2, metric_params=None, n_jobs=1, min_distance=0.0)

Bases: sklearn.neighbors._regression.KNeighborsRegressor

class uncoverml.models.DecisionTreeTransformed(target_transform='identity', *args, **kwargs)

Bases: uncoverml.models.transform_targets.<locals>.TransformedRegressor, uncoverml.models.TagsMixin

Decision tree regression.

http://scikit-learn.org/dev/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor

class uncoverml.models.ExtraTreeTransformed(target_transform='identity', *args, **kwargs)

Bases: uncoverml.models.transform_targets.<locals>.TransformedRegressor, uncoverml.models.TagsMixin

Extremely randomised tree regressor.

http://scikit-learn.org/dev/modules/generated/sklearn.tree.ExtraTreeRegressor.html#sklearn.tree.ExtraTreeRegressor

class uncoverml.models.GLMPredictDistMixin

Bases: object

Mixin class for providing a predict_dist method to the GeneralisedLinearModel class in revrand.

This is especially for use with Gaussian likelihood models.

predict_dist(X, interval=0.95, *args, **kwargs)

Predictive mean and variance for a probabilistic regressor.

Parameters
  • X (ndarray) – (Ns, d) array query dataset (Ns samples, d dimensions).

  • interval (float, optional) – The percentile confidence interval (e.g. 95%) to return.

  • fields (dict, optional) – dictionary of fields parsed from the shape file. indicator_field should be a key in this dictionary. If this is not present, then a Gaussian likelihood will be used for all predictions. The only time this may be input if for cross validation.

Returns

  • Ey (ndarray) – The expected value of ys for the query inputs, X of shape (Ns,).

  • Vy (ndarray) – The expected variance of ys (excluding likelihood noise terms) for the query inputs, X of shape (Ns,).

  • ql (ndarray) – The lower end point of the interval with shape (Ns,)

  • qu (ndarray) – The upper end point of the interval with shape (Ns,)

class uncoverml.models.GradBoostedTrees(*args, **kwargs)

Bases: uncoverml.models.encode_targets.<locals>.EncodedClassifier, uncoverml.models.TagsMixin

Gradient Boosted Trees multi-class classification.

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html

class uncoverml.models.KNearestNeighborTransformed(target_transform='identity', *args, **kwargs)

Bases: uncoverml.models.transform_targets.<locals>.TransformedRegressor, uncoverml.models.TagsMixin

K Nearest Neighbour Regression

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html

class uncoverml.models.LinearReg(onescol=True, var=1.0, regulariser=1.0, tol=1e-08, maxiter=1000, nstarts=100)

Bases: revrand.slm.StandardLinearModel, uncoverml.models.PredictDistMixin, uncoverml.models.MutualInfoMixin

Bayesian standard linear model.

Parameters
  • onescol (bool, optional) – If true, prepend a column of ones onto X (i.e. a bias term)

  • var (Parameter, optional) – observation variance initial value.

  • regulariser (Parameter, optional) – weight regulariser (variance) initial value.

  • tol (float, optional) – optimiser function tolerance convergence criterion.

  • maxiter (int, optional) – maximum number of iterations for the optimiser.

  • nstarts (int, optional) – if there are any parameters with distributions as initial values, this determines how many random candidate starts shoulds be evaluated before commencing optimisation at the best candidate.

class uncoverml.models.LinearRegTransformed(target_transform='identity', *args, **kwargs)

Bases: uncoverml.models.transform_targets.<locals>.TransformedRegressor, uncoverml.models.TagsMixin

Bayesian linear regression.

http://nicta.github.io/revrand/slm.html

class uncoverml.models.LogisticClassifier(*args, **kwargs)

Bases: uncoverml.models.encode_targets.<locals>.EncodedClassifier, uncoverml.models.TagsMixin

Logistic Regression for muli-class classification.

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

class uncoverml.models.LogisticRBF(*args, **kwargs)

Bases: uncoverml.models.encode_targets.<locals>.EncodedClassifier, uncoverml.models.TagsMixin

Approximate large scale kernel logistic regression.

class uncoverml.models.MaskRows(*Xs)

Bases: object

apply_mask(X)
apply_masks(*Xs)
static get_complete_rows(X)
trim_mask(X)
trim_masks(*Xs)
class uncoverml.models.MultiRandomForestTransformed(target_transform='identity', *args, **kwargs)

Bases: uncoverml.models.transform_targets.<locals>.TransformedRegressor, uncoverml.models.TagsMixin

MPI implementation of Random forest regression with forest grown on many CPUS.

http://scikit-learn.org/dev/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor

class uncoverml.models.MutualInfoMixin

Bases: object

Mixin class for providing predictive entropy reduction functionality to the StandardLinearModel class (only).

entropy_reduction(X)

Predictice entropy reduction (a.k.a mutual information).

Estimate the reduction in the posterior distribution’s entropy (i.e. model uncertainty reduction) as a result of including a particular observation.

Parameters

X (ndarray) – (Ns, d) array query dataset (Ns samples, d dimensions).

Returns

MI – Prediction of mutual information (expected reduiction in posterior entrpy) assocated with each query input. The units are ‘nats’, and the shape of the returned array is (Ns,).

Return type

ndarray

class uncoverml.models.PredictDistMixin

Bases: object

Mixin class for providing a predict_dist method to the StandardLinearModel class in revrand.

predict_dist(X, interval=0.95, *args, **kwargs)

Predictive mean and variance for a probabilistic regressor.

Parameters
  • X (ndarray) – (Ns, d) array query dataset (Ns samples, d dimensions).

  • interval (float, optional) – The percentile confidence interval (e.g. 95%) to return.

  • fields (dict, optional) – dictionary of fields parsed from the shape file. indicator_field should be a key in this dictionary. If this is not present, then a Gaussian likelihood will be used for all predictions. The only time this may be input if for cross validation.

Returns

  • Ey (ndarray) – The expected value of ys for the query inputs, X of shape (Ns,).

  • Vy (ndarray) – The expected variance of ys (excluding likelihood noise terms) for the query inputs, X of shape (Ns,).

  • ql (ndarray) – The lower end point of the interval with shape (Ns,)

  • qu (ndarray) – The upper end point of the interval with shape (Ns,)

class uncoverml.models.RandomForestClassifier(*args, **kwargs)

Bases: uncoverml.models.encode_targets.<locals>.EncodedClassifier, uncoverml.models.TagsMixin

Random Forest for muli-class classification.

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

class uncoverml.models.RandomForestRegressor(n_estimators=100, criterion='mse', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, ccp_alpha=0.0, max_samples=None)

Bases: sklearn.ensemble._forest.RandomForestRegressor

Implements a “probabilistic” output by looking at the variance of the decision tree estimator ouputs.

predict_dist(X, interval=0.95)
class uncoverml.models.RandomForestRegressorMulti(outdir='.', forests=10, parallel=True, n_estimators=10, random_state=1, **kwargs)

Bases: object

fit(x, y, *args, **kwargs)
predict(x)
predict_dist(x, interval=0.95, *args, **kwargs)
class uncoverml.models.RandomForestTransformed(target_transform='identity', *args, **kwargs)

Bases: uncoverml.models.transform_targets.<locals>.TransformedRegressor, uncoverml.models.TagsMixin

Random forest regression.

http://scikit-learn.org/dev/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor

class uncoverml.models.SGDApproxGP(kernel='rbf', nbases=50, lenscale=1.0, var=1.0, regulariser=1.0, ard=True, maxiter=3000, batch_size=10, alpha=0.01, beta1=0.9, beta2=0.99, epsilon=1e-08, random_state=1, nstarts=500)

Bases: uncoverml.models.BasisMakerMixin, revrand.glm.GeneralisedLinearModel, uncoverml.models.GLMPredictDistMixin

An approximate Gaussian process for large scale data using stochastic gradients.

This uses the Adam stochastic gradients algorithm; http://arxiv.org/pdf/1412.6980

Parameters
  • kern (str, optional) – the (approximate) kernel to use with this Gaussian process. Have a look at basismap dictionary for appropriate kernel approximations.

  • nbases (int) – how many unique random bases to create (twice this number will be actually created, i.e. real and imaginary components for each base). The higher this number, the more accurate the kernel approximation, but the longer the runtime of the algorithm. Usually if X is high dimensional, this will have to also be high dimensional.

  • lenscale (float, optional) – the initial value for the kernel length scale to be learned.

  • ard (bool, optional) – Whether to use a different length scale for each dimension of X or a single length scale. This will result in a longer run time, but potentially better results.

  • var (float, optional) – observation variance initial value.

  • regulariser (float, optional) – weight regulariser (variance) initial value.

  • maxiter (int, optional) – Number of iterations to run for the stochastic gradients algorithm.

  • batch_size (int, optional) – number of observations to use per SGD batch.

  • alpha (float, optional) – stepsize to give the stochastic gradient optimisation update.

  • beta1 (float, optional) – smoothing/decay rate parameter for the stochastic gradient, must be [0, 1].

  • beta2 (float, optional) – smoothing/decay rate parameter for the squared stochastic gradient, must be [0, 1].

  • epsilon (float, optional) – “jitter” term to ensure continued learning in stochastic gradients (should be small).

  • random_state (int or RandomState, optional) – random seed

  • nstarts (int, optional) – if there are any parameters with distributions as initial values, this determines how many random candidate starts shoulds be evaluated before commencing optimisation at the best candidate.

Note

Setting the random_state may be important for getting consistent looking predictions when many chunks/subchunks are used. This is because the predictive distribution is sampled for these algorithms!

class uncoverml.models.SGDApproxGPTransformed(target_transform='identity', *args, **kwargs)

Bases: uncoverml.models.transform_targets.<locals>.TransformedRegressor, uncoverml.models.TagsMixin

Approximate Gaussian processes with stochastic gradients.

http://nicta.github.io/revrand/glm.html

class uncoverml.models.SGDLinearReg(onescol=True, var=1.0, regulariser=1.0, maxiter=3000, batch_size=10, alpha=0.01, beta1=0.9, beta2=0.99, epsilon=1e-08, random_state=None, nstarts=500)

Bases: revrand.glm.GeneralisedLinearModel, uncoverml.models.GLMPredictDistMixin

Bayesian standard linear model, using stochastic gradients.

This uses the Adam stochastic gradients algorithm; http://arxiv.org/pdf/1412.6980

Parameters
  • onescol (bool, optional) – If true, prepend a column of ones onto X (i.e. a bias term)

  • var (Parameter, optional) – observation variance initial value.

  • regulariser (Parameter, optional) – weight regulariser (variance) initial value.

  • maxiter (int, optional) – Number of iterations to run for the stochastic gradients algorithm.

  • batch_size (int, optional) – number of observations to use per SGD batch.

  • alpha (float, optional) – stepsize to give the stochastic gradient optimisation update.

  • beta1 (float, optional) – smoothing/decay rate parameter for the stochastic gradient, must be [0, 1].

  • beta2 (float, optional) – smoothing/decay rate parameter for the squared stochastic gradient, must be [0, 1].

  • epsilon (float, optional) – “jitter” term to ensure continued learning in stochastic gradients (should be small).

  • random_state (int or RandomState, optional) – random seed

  • nstarts (int, optional) – if there are any parameters with distributions as initial values, this determines how many random candidate starts shoulds be evaluated before commencing optimisation at the best candidate.

Note

Setting the random_state may be important for getting consistent looking predictions when many chunks/subchunks are used. This is because the predictive distribution is sampled for these algorithms!

class uncoverml.models.SGDLinearRegTransformed(target_transform='identity', *args, **kwargs)

Bases: uncoverml.models.transform_targets.<locals>.TransformedRegressor, uncoverml.models.TagsMixin

Baysian linear regression with stochastic gradients.

http://nicta.github.io/revrand/glm.html

class uncoverml.models.SVRTransformed(target_transform='identity', *args, **kwargs)

Bases: uncoverml.models.transform_targets.<locals>.TransformedRegressor, uncoverml.models.TagsMixin

Support vector machine.

http://scikit-learn.org/dev/modules/svm.html#svm

class uncoverml.models.SupportVectorClassifier(*args, **kwargs)

Bases: uncoverml.models.encode_targets.<locals>.EncodedClassifier, uncoverml.models.TagsMixin

Support Vector Machine multi-class classification.

http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

class uncoverml.models.TagsMixin

Bases: object

Mixin class to aid a pipeline in establishing the types of predictive outputs to be expected from the ML algorithms in this module.

get_predict_tags()

Get the types of prediction outputs from this algorithm.

Returns

of strings with the types of outputs that can be returned by this algorithm. This depends on the prediction methods implemented (e.g. predict, predict_dist`, entropy_reduction).

Return type

list

class uncoverml.models.TransformedCTInterpolator(target_transform='identity', *args, **kwargs)

Bases: uncoverml.models.transform_targets.<locals>.TransformedRegressor, uncoverml.models.TagsMixin

class uncoverml.models.TransformedLinearNDInterpolator(target_transform='identity', *args, **kwargs)

Bases: uncoverml.models.transform_targets.<locals>.TransformedRegressor, uncoverml.models.TagsMixin

class uncoverml.models.TransformedNearestNDInterpolator(target_transform='identity', *args, **kwargs)

Bases: uncoverml.models.transform_targets.<locals>.TransformedRegressor, uncoverml.models.TagsMixin

class uncoverml.models.TransformedRbfInterpolator(target_transform='identity', *args, **kwargs)

Bases: uncoverml.models.transform_targets.<locals>.TransformedRegressor, uncoverml.models.TagsMixin

uncoverml.models.apply_masked(func, data, *args, **kwargs)
uncoverml.models.apply_multiple_masked(func, data, *args, **kwargs)
uncoverml.models.bootstrap_model(model)
uncoverml.models.encode_targets(Classifier)
uncoverml.models.kernelize(classifier)
uncoverml.models.transform_targets(Regressor)

Factory function that add’s target transformation capabiltiy to compatible scikit learn objects.

Look at the transformers.py module for more information on valid target transformers.

Example

>>> svr = transform_targets(SVR)(target_transform='Standardise', gamma=0.1)

uncoverml.mpiops module

uncoverml.mpiops.chunk_index = 0

the index (from zero) of this node in the MPI world. Also known as the rank of the node.

Type

int

uncoverml.mpiops.chunks = 1

the total number of nodes in the MPI world

Type

int

uncoverml.mpiops.comm = <mpi4py.MPI.Intracomm object>

module-level MPI ‘world’ object representing all connected nodes

uncoverml.mpiops.count(x)
uncoverml.mpiops.count_targets(targets)
uncoverml.mpiops.covariance(x)
uncoverml.mpiops.create_shared_array(data, root=0, writeable=False)

Create a shared numpy array among MPI nodes. To access the data, refer to the return numpy array ‘shared’. The second return value is the MPI window. This doesn’t need to be interacted with except when deallocating the memory.

When finished with the data, set shared = None and call win.Free().

Caution: any node with a handle on the shared array can modify its contents. To be safe, the shared array is set to read-only by default.

Parameters
  • data (numpy.ndarray) – The numpy array to share.

  • root (int) – Rank of the root node that contains the original data.

  • writeable (bool) – Whether or not the resulting shared array is writeable.

Returns

Return type

tuple of numpy.ndarray, MPI window

uncoverml.mpiops.eigen_decomposition(x)
uncoverml.mpiops.max_axis_0(x, y, dtype)
uncoverml.mpiops.mean(x)
uncoverml.mpiops.min_axis_0(x, y, dtype)
uncoverml.mpiops.minimum(x)
uncoverml.mpiops.outer(x)
uncoverml.mpiops.outer_count(x)
uncoverml.mpiops.power(x, exp)
uncoverml.mpiops.random_full_points(x, Napprox)
uncoverml.mpiops.run_once(f, *args, **kwargs)

Run a function on one node, broadcast result to all This function evaluates a function on a single node in the MPI world, then broadcasts the result of that function to every node in the world. :param f: The function to be evaluated. Can take arbitrary arguments and return

anything or nothing

Parameters
  • args (optional) – Other positional arguments to pass on to f

  • kwargs (optional) – Other named arguments to pass on to f

Returns

The value returned by f

Return type

result

uncoverml.mpiops.sd(x)
uncoverml.mpiops.sum_axis_0(x, y, dtype)
uncoverml.mpiops.unique(sets1, sets2, dtype)

uncoverml.patch module

Image patch extraction and windowing utilities.

uncoverml.patch.all_patches(image, patchsize)
uncoverml.patch.grid_patches(image, pwidth)

Generate (overlapping) patches from an image. This function extracts square patches from an image in an overlapping, dense grid.

Parameters
  • image (ndarray) – an array of shape (x, y) or (x, y, channels).

  • pwidth (int) – the half-width of the square patches to extract, in pixels. E.g. pwidth = 0 gives a 1x1 patch, pwidth = 1 gives a 3x3 patch, pwidth = 2 gives a 5x5 patch etc. The formula for calculating the full patch width is pwidth * 2 + 1.

Returns

patch – An image of shape (x, y, channels*psize*psize), where psize = pwidth * 2 + 1

Return type

ndarray

uncoverml.patch.patches_at_target(image, patchsize, targets)
uncoverml.patch.point_patches(image, pwidth, points)

Extract patches from an image at specified points.

Parameters
  • image (ndarray) – an array of shape (x, y, channels).

  • pwidth (int) – the half-width of the square patches to extract, in pixels. E.g. pwidth = 0 gives a 1x1 patch, pwidth = 1 gives a 3x3 patch, pwidth = 2 gives a 5x5 patch etc. The formula for calculating the full patch width is pwidth * 2 + 1.

  • points (ndarray) – of shape (N, 2) where there are N points, each with an x and y coordinate of the patch centre within the image.

Returns

patches – An image patch array of shape (N, psize, psize, channels), where psize = pwidth * 2 + 1

Return type

ndarray

uncoverml.predict module

uncoverml.predict.cluster_analysis(x, y, partition_no, config, feature_names)
Parameters
  • x (ndarray) – array of dim (Ns, d)

  • y (ndarry) – array of predictions of dimension (Ns, 1)

  • partition_no (int) – partition number of the image

  • config (config object) –

  • feature_names (list) – list of strings corresponding to ordered feature names

uncoverml.predict.div0(a, b)

ignore / 0, div0( [-1, 0, 1], 0 ) -> [0, 0, 0]

uncoverml.predict.final_cluster_analysis(n_classes, n_paritions)
uncoverml.predict.predict(data, model, interval=0.95, **kwargs)
uncoverml.predict.render_partition(model, subchunk, image_out, config)
uncoverml.predict.shapefile_prediction(config, model)
uncoverml.predict.write_mean_and_sd(x, y, writer, config)

uncoverml.resampling module

Module for shapefile resampling methods. This code was originailly developed by Sudipta Basak. (https://github.com/basaks)

See uncoverml.scripts.shiftmap_cli for a resampling CLI.

uncoverml.resampling.bootstrap_data_indicies(population, samples=None, random_state=1)
uncoverml.resampling.filter_fields(fields_to_keep, gdf)
uncoverml.resampling.prepapre_dataframe(data, fields_to_keep)
uncoverml.resampling.resample_by_magnitude(input_data, target_field, bins=10, interval='percentile', fields_to_keep=[], bootstrap=True, output_samples=None, validation=False, validation_points=100)
Parameters
  • input_gdf (geopandas.GeoDataFrame) – Geopandas dataframe containing targets to be resampled.

  • target_field (str) – target field name based on which resampling is performed. Field must exist in the input_shapefile

  • bins (int) – number of bins for sampling

  • fields_to_keep (list) – of strings to store in the output shapefile

  • bootstrap (bool, optional) – whether to sample with replacement or not

  • output_samples (int, optional) – number of samples in the output shpfile. If not provided, the output samples will be assumed to be the same as the original shapefile

  • validation (bool, optional) – validation file name

  • validation_points (int, optional) – approximate number of points in the validation shapefile

uncoverml.resampling.resample_spatially(input_data, target_field, rows=10, cols=10, fields_to_keep=[], bootstrap=True, output_samples=None, validation_points=100)
Parameters
  • input_shapefile

  • output_shapefile

  • target_field (str) – target field name based on which resampling is performed. Field must exist in the input_shapefile

  • rows (int, optional) – number of bins in y

  • cols (int, optional) – number of bins in x

  • fields_to_keep (list of strings to store in the output shapefile) –

  • bootstrap (bool, optional) – whether to sample with replacement or not

  • output_samples (int, optional) – number of samples in the output shpfile. If not provided, the output samples will be assumed to be the same as the original shapefile

  • validation_points (int, optional) – approximate number of points in the validation shapefile

Returns

Return type

output_shapefile name

uncoverml.targets module

class uncoverml.targets.Targets(lonlat, vals, othervals=None)

Bases: object

classmethod from_geodataframe(gdf, observations_field='observations')

Returns a Targets object from a geopandas dataframe. One column will be taken as the main ‘observations’ field. All remaining non-geometry columns will be stored in the fields property.

Parameters

observations_field (str) – Name of the column in the dataframe that is the main target observation (the field to train on).

Returns

Return type

Targets

to_geodataframe()

Returns a copy of the targets as a geopandas dataframe.

Returns

Return type

geopandas.GeoDataFrame

uncoverml.targets.gather_targets(targets, keep, node=None)
uncoverml.targets.gather_targets_main(targets, keep, node)
uncoverml.targets.generate_covariate_shift_targets(targets, bounds, n_points)
uncoverml.targets.generate_dummy_targets(bounds, label, n_points, field_keys=[], seed=1)

Generate dummy points with randomly generated positions. Points are generated on node 0 and distributed to other nodes if running in parallel.

Parameters
  • bounds (tuple of float) – Bounding box to generate targets within, of format (xmin, ymin, xmax, ymax).

  • label (str) – Label to assign generated targets.

  • n_points (int) – Number of points to generate

  • field_keys (list of str, optional) – List of keys to add to fields property.

  • seed (int, optional) – Random number generator seed.

Returns

A collection of randomly generated targets.

Return type

Targets

uncoverml.targets.label_targets(targets, label, backup_field=None)

Replaces target observations (the target property being trained on) with the given label.

Parameters
  • targets (Targets) – A collection of targets to label.

  • label (str) – The label to apply.

  • backup_field (str) – If present, copies the original observation data to the fields property with the provided string as the key.

Returns

The labelled targets.

Return type

Targets

uncoverml.targets.merge_targets(a, b)

Merges two Targets objects. They will be sorted the canonical uncover-ml way: lexically by position (y, x).

Parameters
  • a (Target) – The Targets to merge.

  • b (Target) – The Targets to merge.

Returns

A single merged collection of targets.

Return type

Targets

uncoverml.targets.save_dropped_targets(config, keep, targets)
uncoverml.targets.save_targets(targets, path, obs_filter=None)

Saves target positions and observation data to a CSV file.

Parameters
  • targets (Targets) – The targets to save.

  • path (str) – Path to file to save as.

  • obs_filter (any, optional) – If provided, will only save points that have this observation data.

uncoverml.validate module

Scripts for validation

class uncoverml.validate.CrossvalInfo(scores, y_true, y_pred, classification, positions)

Bases: object

export_crossval(config)

Exports a CSV file containing real target values and their corresponding predicted value generated as part of cross-validation.

Also populates the ‘prediction’ column of the ‘rawcovariates’ CSV file.

If enabled, the real vs predicted values will be plotted.

Parameters

config (Config) – Uncover-ml config object.

class uncoverml.validate.OOSInfo(scores, y_true, y_pred, classification, positions)

Bases: uncoverml.validate.CrossvalInfo

export_scores(config)
uncoverml.validate.adjusted_r2_score(r2, n_samples, n_covariates)
uncoverml.validate.classification_validation_scores(ys, eys, pys)

Calculates the validation scores for a regression prediction Given the test and training data, as well as the outputs from every model, this function calculates all of the applicable metrics in the following list, and returns a dictionary with the following (possible) keys:

  • accuracy

  • log_loss

  • f1

Parameters
  • ys (numpy.array) – The test data outputs, one-hot representation

  • eys (numpy.array) – The (hard) predictions made by the trained model on test data, one-hot representation

  • pys (numpy.array) – The probabilistic predictions made by the trained model on test data

Returns

scores – A dictionary containing all of the evaluated scores.

Return type

dict

uncoverml.validate.local_crossval(x_all, targets_all, config)

Performs K-fold cross validation to test the applicability of a model. Given a set of inputs and outputs, this function will evaluate the effectiveness of a model at predicting the targets, by splitting all of the known data. A model is trained on a subset of the total data, and then this model is used to predict all of the unseen targets, its performance can provide a benchmark to evaluate the effectiveness of a model.

Parameters
  • x_all (numpy.array) – A 2D array containing all of the training inputs

  • targets_all (numpy.array) – A 1D vector containing all of the training outputs

  • config (dict) – The global config object, which is used to choose the model to train.

Returns

result – A dictionary containing all of the cross validation metrics, evaluated on the unseen data subset.

Return type

dict

uncoverml.validate.local_rank_features(image_chunk_sets, transform_sets, targets, config)

Ranks the importance of the features based on their performance. This function trains and cross-validates a model with each individual feature removed and then measures the performance of the model with that feature removed. The most important feature is the one which; when removed, causes the greatest degradation in the performance of the model.

Parameters
  • image_chunk_sets (dict) – A dictionary used to get the set of images to test on.

  • transform_sets (list) – A dictionary containing the applied transformations

  • targets (instance of geoio.Targets class) – The targets used in the cross validation

  • config (config class instance) – The global config file

uncoverml.validate.out_of_sample_validation(model, targets, features, config)
uncoverml.validate.permutation_importance(model, x_all, targets_all, config)
uncoverml.validate.regression_validation_scores(y, ey, n_covariates, model)

Calculates the validation scores for a regression prediction Given the test and training data, as well as the outputs from every model, this function calculates all of the applicable metrics in the following list, and returns a dictionary with the following (possible) keys:

  • r2_score

  • expvar

  • smse

  • lins_ccc

  • mll

Parameters
  • y (numpy.array) – The test data outputs

  • ey (numpy.array) – The predictions made by the trained model on test data

  • n_covariates (int) – The number of covariates being used.

Returns

scores – A dictionary containing all of the evaluated scores.

Return type

dict

uncoverml.validate.split_cfold(nsamples, k=5, seed=None)

Function that returns indices for splitting data into random folds.

Parameters
  • nsamples (int) – the number of samples in the dataset

  • k (int, optional) – the number of folds

  • seed (int, optional) – random seed to provide to numpy

Returns

  • cvinds (list) – list of arrays of length k, each with approximate shape (nsamples / k,) of indices. These indices are randomly permuted (without replacement) of assignments to each fold.

  • cvassigns (ndarray) – array of shape (nsamples,) with each element in [0, k), that can be used to assign data to a fold. This corresponds to the indices of cvinds.

Module contents