.. _workflows: Workflows ========= UncoverML workflows are controlled by a YAML configuration file. This section provides some examples and explanations of different workflows, the required configuration, uncoverml commands and the produced outputs There some example YAML files in the *uncover-ml/configs* directory to help you get started. .. _learn_and_predict: Learn and Predict ----------------- The most common task in UncoverML is using it train a model and then perform a prediction. This requires: - A config file with necessary parameters - A set of geotiffs containing covariate/feature data - A shapefile containing training targets Config ~~~~~~ An example of a learn/predict config: .. code:: yaml learning: algorithm: transformedrandomforest arguments: n_estimators: 10 target_transform: log features: - type: ordinal files: - directory: path/to/covariates/directory transforms: - centre - standardise imputation: mean targets: file: path/to/training/target/shapefile property: Na_log prediction: quantiles: 0.95 outbands: 4 output: directory: path/to/store/output The ``learning`` block specifies the algorithm/model to train: - ``algorithm``: the name of the algorithm - ``arguments``: a dictionary of keyword arguments specific to that algorithm. See *models page* for available algorithms and links to their documentation to understand what parameters are available for each algorithm. - ``target_transform``: this argument is specific to UncoverML and an addition to the algorithm's base arguments. Target values will be transformed using this method before being passed to the algorithm's ``fit`` and ``predict`` methods. - ``identity``: identity transform - does nothing to the data. This is the default option if ``target_transform`` is not provided. - ``standardise``: applies a standardisation transform. - ``sqrt``: applies a square root transform. - ``log``: applies a log10 transform. - ``logistic``: applies a logistic transform. - ``rank``: apply a RankGauss transform. - ``kde``: apply a KDE transform. The ``features`` block contains the features/covariates to be used for training and prediction: - ``type``: UncoverML supports ordinal and categorical data, but they must be provided in separate sets and the ``type`` must be provided (if no type is provided, UncoverML will assume the data is ordinal). - ``files``: The location of the geotiffs. Multiple entries can be provided for ``files`` and can be of type ``directory``, ``path`` or ``list`` - ``directory``: Path to a directory. All geotiffs within this directory will be loaded as features. - ``path``: Path to a geotiff file. This will be loaded as a feature. - ``list``: Path to a text file. Each line of the text file is a path to a geotiff and geotiff will be loaded as a feature. - ``transforms``: Transforms to use for scaling features. Multiple transforms can be applied to each set of features specified. - ``centre``: applies a centre transform (demeans the data). - ``standardise``: applies a standardisation transform. - ``log``: applies a log10 transform. - ``sqrt``: applies a square root transform. - ``whiten``: applies a Whiten transform. - ``onehot``: performs OneHot encoding - ``randomhot``: performs RandomHot encoding. - ``imputation``: the imputer to use for filling missing values in the provided features. - ``mean``: mean imputation. - ``gauss``: Gauss imputation. - ``nn``: nearest neighbour imputation. *Geotiffs must be the same size and resolution, otherwise an exception will be raised.* Multiple feature sets can be specified for different types of data, or for applying different transforms: .. code:: yaml features: - type: categorical files: - path: path/to/categorical/feature transforms: - onehot - type: ordinal files: - list path/to/feature/list.txt transforms: - standardise imputation: nn The ``targets`` block specifies the training targets to be used: - ``file``: path to the shapefile containing training targets - ``property``: name of the field in the shapefile to be used as training value. If not provided then the first field of the shapefile will be used. *At least some of the targets in the shapefile must intersect with the provided features. Targets outside of the feature area will be ignored* The ``prediction`` block specifies what outbands to generate and the quantile interval for producing upper and lower quantile maps: - ``quantiles``: a value between 0 and 1 representing the prediction interval. E.g. '0.95' means the predicted values will fall within the lower and upper quantiles 95% of the time. - ``outbands``: each band will be written as a separate geotiff file. For classification, the available outbands is equivalent to the available classes. For regression, the first outband is prediction and if the model provides them, the next are variance, lower quantile and upper quantile. Some specific models provide further options - refer the documentation for the specific model you are using. The ``outbands`` number is used as the RHS of a slice, so providing '1' for a regression will output prediction (0) and variance (1). The ``output`` block directs UncoverML where to store learning and prediction outputs. - ``directory``: path to a directory to store the outputs. Will be created if it does not exist. All outputs are prefixed with the config file name. Running ~~~~~~~ To run this workflow, first train the model: .. code:: bash uncoverml learn config.yaml Once the command is complete, outputs will exist in the directory specified in the ``output`` block. For learning, these are: - ``rawcovariates.csv``: a table containing the value from each covariate and the target value for a given point. This is before any scaling or transforms are applied. - ``rawcovariates_mask.csv``: a table representing the mask for each point, i.e. which covariates are missing from which coordinates. - ``transformed_targets.csv``: a table containing untransformed target values in the first column and the transformed value in the other column. - ``*.model``: a Python binary file containing the trained model. It is not accessed directly but is used as part of the prediction step. Predictions can then be performed across the entire feature space: .. code:: bash uncoverml predict config.yaml Upon completion, prediction maps will be written to the output directory. There will be one map for each band requested as output in the configuration file. The map will be named after the band, e.g. ``prediction.tif`` for the main prediction band. Downsampled versions of the maps will also be generated for use as thumbnails. Shapefile Predictions --------------------- Besides images, feature data can also be extracted from shapefiles. If this method is used, then the prediction will be written out as a shapefile. Config ~~~~~~ To use shapefiles, the ``feature`` block of the config must be modified: .. code:: yaml features: - type: ordinal shapefile: file: /path/to/shapefile fields: - field1 - field2 - field3 ndv: None transforms: - standardise imputation: mean It's similar to the structure for using geotiffs with some changes: - ``shapefile``: this starts the shapefile block and signifies that shapefile training and prediction is being used. *Note that currently it is not possible to mix shapefile and image features. An error will occur if you attempt to do so*. - ``file``: path to the shapefile containing features. The file you provide must have the same number of records as your target shapefile, and must have the same geometries. - ``fields``: a list of fields in the shapefile that will be used as covariates. - ``ndv``: the no data value for the shapefile. Samples with this value will be masked out when training and predicting. There's also the addition of the ``drop`` parameter to the targets block: .. code:: yaml targets: file: ... property: ... drop: ['unknown', 'undefined'] - ``drop``: a list of values to drop from the targets. Rows where the ``property`` field contains one of these values will be dropped when intersecting features and targets. This is for convenience of training on files with missing data that you want to predict. You can drop the rows containing the missing data for training, and then predict on the full shapefile. .. note:: Currently point intersection between two shapefiles isn't implemented. You need to ensure that your target shapefile (after any rows have been dropped) is a 1-to-1 positional match with your covariate shapefile. Both files will be ordered by position and it's assumed they will have the same number of rows and each row in one file corresponds to a row in the other. Running ~~~~~~~ Employ the same ``learn`` and ``predict`` commands as you would for learning and prediction in the :ref:`learn and predict ` section. The output will be the same as with image-based predictions, but instead of geotiffs, predictions will be written to a shapefile: - ``..._prediction.*``: shapefile containing predicted values. Model Validation ---------------- Validation can be performed during the training process to generate model performance metrics. UncoverML supports k-fold cross validation, out-of-sample validation, feature ranking and permutation importance. Config ~~~~~~ To perform validation, the ``validation`` block must be provided in your config file. The usual parameters for learning must be provided (see :ref:`learn_and_predict`). .. code:: yaml validation: feature_rank: True permutation_importance: True out_of_sample: percentage: 0.2 shapefile: /path/to/shapefile property: name_of_field k-fold: parallel: False folds: 5 random_seed: 1 - ``feature_rank``: a boolean that turns feature ranking on or off. - ``permutation_importance``: a boolean that turns permutation importance on or off. - ``out_of_sample``: out-of-sample validation parameters - ``percentage``: float between 0 and 1, the percentage of targets to withhold from the training data to be used for post-learning validation. - ``shapefile``: path to a shapefile containing targets to be used in out-of-sample validation. - ``property``: name of the field in the shapefile to use as training value. If not provided then the first field of the shapefile will be used. .. note:: ``percentage`` has priority over ``shapefile`` - if both are provided then the shapefile argument will be ignored. - ``k-fold``: k-fold cross validation parameters - ``parallel``: a boolean that specifies whether folds are trained and predicted in parallel. - ``folds``: the number of folds to split training data into. - ``random_seed``: an integer used as the seed for the random number generator that splits folds. Using the same seed will produce deterministic fold splits. Running ~~~~~~~ Model validation is performed as part of the training process. To perform model validation, run: .. code:: bash uncoverml learn config.yaml Once complete, validation results will be in the output directory: - ``crossval_results.csv``: A table of target values and corresponding predicted value for k-fold. - ``crossval_scores.json``: A dictionary containing the value for each metric for k-fold. - ``featureranks.json``: Dictionaries containing feature rank results. The ``ranks`` dictionary presents the most important feature by its impact on each metric (which feature caused the most degradation of the metric when excluded from training). The ``scores`` dictionary contains the corresponding value for each rank. - ``permutation_importance.csv``: The permutation importance results. - ``oos_results.csv``: A table of targets values and corresponding predicted value for out-of-sample validation. - ``oos_scores.json``: A dictionary containing the value for each metric for out-of-sample validation. - ``oos_targets.shp``: If ``percentage`` was provided as part of out-of-sample arguments, the withheld targets will be saved to this shapefile for reuse. In addition to these, the 'prediction' column of ``rawcovariates.csv`` will be filled with the corresponding prediction value generated during cross-validation. Diagnostics ----------- UncoverML can generate diagnostic plots. These are controlled by additional parameters in the ``output`` block: .. code:: yaml output: directory: path/to/output/directory plot_feature_ranks: True plot_intersection: True plot_real_vs_pred: True plot_correlation: True plot_target_scaling: True Covariate Correlation ~~~~~~~~~~~~~~~~~~~~~ This generates a correlation matrix showing correlation between provided covariates. The color/value of each square shows how strong the positive or negative correlation is: .. image:: sirsam_Na_randomforest_correlation.png :alt: Covariate correlation matrix. To generate a correlation matrix, the ``learn`` command must be run with ``plot_correlation`` set to ``True`` in the configuration. Feature Ranking ~~~~~~~~~~~~~~~ The result of feature ranking can be plotted. This creates two plots. One is a grouped bar chart - each colored bar represents a metric, the Y-axis is the score of that metric and each group is a feature (AKA covariate): .. image:: sirsam_Na_randomforest_featureranks.png The other plot is a series of curves. Each curve represents a metric, with scores on the Y-axis and covariates on the X-axis. Each inflection represents what happens to the score when the corresponding covariate is removed from the model: .. image:: fr_curve.png To generate feature rank plots, the ``learn`` command must be run with ``feature_rank`` set to ``True`` under the ``validation`` block of the config, and ``plot_feature_ranks`` set to ``True``. Covariate Intersection ~~~~~~~~~~~~~~~~~~~~~~ Covariate intersection can be plotted. These scatter plots, one for each covariate, show the corresponding covariate value for each target value: .. image:: intersection.png To generate intersection plots, the ``learn`` command must be run and ``plot_intersection`` must be ``True`` in the configuration. Real vs. Prediction ~~~~~~~~~~~~~~~~~~~ A scatter plot showing the intersection between real vs. predicted values can be generated. This compares the values predicted by the model to the actual target values, and is important for validating model. It includes several pieces of information. The points show the intersecting values. The 2D histogram in the background shows grouping of the points (the colorbar on the right-hand side codes the frequency of each grouping). The text in the upper-left shows the score from the metrics generated by cross-validation: .. image:: sirsam_Na_randomforest_real_vs_pred.png A histogram of residual error is also generated. This shows bins grouping the residual error between intersected points (i.e. the difference between a predicted value and the corresponding real value): .. image:: sirsam_Na_randomforest_residual.png To generate a real vs. prediction plot, ``k-fold`` must be enabled under the ``validation`` block of the config, and ``plot_real_vs_pred`` must be set to ``True``. .. _Multiprocessing and Partitioning: Multiprocessing and Partitioning -------------------------------- UncoverML supports parallelization using MPI and also supports partitioning for large datasets. When and how to use these depends on the task being performed. Learning ~~~~~~~~ When running ``learn``, it can be beneficial to use multiple processors to train folds for k-fold cross validation in parallel. For example, if you have enabled crossval with 5 folds and set ``parallel: True`` in the ``k-fold`` block, then you can train the folds in parallel using: .. code:: bash mpirun -n 5 uncoverml learn config.yaml This will speed up the cross-validation process. If not utilising parallel crossval, there is no benefit to using multiple processors. A model can only be trained on a single processor, so to avoid wasting compute resources, train models with 1 processor: .. code:: bash uncoverml learn config.yaml # equivalent to `mpirun -n 1 uncoverml learn config.yaml` An exception is when learning an ensemble model. Examples are ``bootstrapsvr`` and ``multirandomforest``. These are ensembles of several individual models. When training an ensemble model, it can be beneficial to run with multiple processors as each processor can train a submodel. Note that running with more processors than there are submodels to train will waste compute resources as the remainder (n_processors - n_models) processors will have no work to perform. For example, if you were training a ``bootstrapsvr`` model with 100 models, and you were running on a single HPC node with 48 cpus, you can run in parallel using: .. code:: bash mpirun -n 48 uncoverml learn config.yaml This will greatly reduce the training time of an ensemble model. Prediction ~~~~~~~~~~ Prediction gains the most benefits from multiprocessing and partitioning. Because the model is already trained, prediction can be done by splitting up the feature space into chunks and distributing them to processors to speed up prediction. These chunks can then also be further partitioned and loaded sequentially on each processor to reduce memory usage. While multiple CPUs will speed up prediction, partitioning will have an impact on speed, as it increases I/O overhead. However, it allows prediction on larger-than-memory datasets. For example: .. code:: bash mpirun -n 48 uncoverml predict config.yaml -p 5 This command will split the feature data into 48 even chunks and distribute it amongst the CPUs. Each chunk will then be split into 5 partitions (the ``-p 5`` parameter) and loaded, predicted on and written out sequentially, reducing memory usage. .. _optimisation-section: Optimisation ------------ UncoverML provides a framework for using scikit-learn's `GridSearchCV `_ to perform hyperparameter tuning. This allows cross-validation to be performed with different combinations of parameters. .. note:: Only some models are compatible with optimisation. This is because models must be structured in a way compatible with scikit-learn's GridSearchCV. This involves: - having all arguments explicitly listed in the ``__init__`` signature (no varargs) - having the expected functions (``fit``, ``predict``, etc.) - implementing the ``get_params`` and ``set_params`` functions defined by `Base Estimator `_ Optimisable models are listed in the :ref:`models section `. Config ~~~~~~ An example config for performing optimisation on `transformedrandomforest`: .. code:: yaml optimisation: algorithm: transformedrandomforest scorers: [r2, expvar, smse, lins_ccc] hyperparameters: target_transform: [identity, standardise, log, sqrt] n_estimator: [10, 20, 50] max_features: [auto, sqrt, log2] min_samples_split: [2, 5, 10] min_samples_leaf: [1, 5, 10] max_depth: [10, 20, 100] output: directory: path/to/output/directory - ``algorithm``: name of the algorithm to optimise. - ``scorers``: a list of metrics to consider when ranking parameter combinations. - ``r2``, ``expvar``, ``smse`` and ``lins_ccc`` are the available regression model parameters. - ``accuracy``, ``log_loss`` and ``auc`` are applicable to classifiers. - ``hyperparameters``: contains lists of values for various algorithm parameters - view the documentation for the algorithm to know parameters are available. Every permutation of the given parameters will be scored using cross-validation. Running ~~~~~~~ Optimisation is run using: .. code:: bash uncoverml gridsearch config.yaml -n 10 The ``-n 10`` is the number of parallel workers to use. Gridsearch uses the joblib backend, so don't use ``mpirun`` to execute it, but rather set ``-n`` to the number of processors you have available. The output will be in the output directory: - ``optimisation.csv``: a table of gridsearch results. For details, see the documentation for ``cv_results_`` attribute in the `sklearn docs `_. Shiftmap -------- UncoverML has a ``shiftmap`` feature that generates a map of the covariate shift in your feature space. 'Covariate shift' is the shifting of covariate distributions across different areas. The idea behind shiftmap is to generate 'dummy' targets, equal in number to the provided training targets. These dummy targets are distributed randomly across the feature space. The two sets of targets (dummy and real) are labeled 'query' and 'training' respectively. A logistic classifier is then trained on these targets and classification maps are generated. Areas where the classifier can't distinguish between the two classes (e.g. 0.4 - 0.6 certainty for one or the other class) means the covariate data in that area has a similar distribution to the provided training data. In theory, these are areas where the model can be extrapolated and will perform well, as the distribution is similar to the provided training data. Areas with a strong covariate shift (e.g. 0 - 0.4, 0.6 - 1 certainty for either class) may not be suitable for extrapolating the model due to differences in distribution between training data and feature data. Running ~~~~~~~ ``Shiftmap`` requires a standard learning config and doesn't require extra parameters. To run: .. code:: bash uncoverml shiftmap config.yaml Covariate shiftmaps will be written to the output directory: - ``shiftmap_generated_points.csv``: a table of the randomly generated points which can be used for validation and debugging. - ``shiftmap_query_0.tif``: a map showing the likelihood of each pixel belonging to the 'query' class. Areas of uncertainty (0.4 - 0.6 for example) are areas with similar distributions to the training data. - ``shiftmap_training_1.tif``: same as above but inverted; shows the likelihood of each pixel belonging to the training class. - ``most_likely.tif``: a map showing which class each pixel is most likely to belong. This can be ignored as it does not help demonstrate the covariate shift, but is left for debugging purposes. Target Search ------------- UncoverML implements a novel ``targetsearch`` command. Targetsearch is used to help find subsets of training points that are similar to a particular study area. For example, you might have a national dataset of geochem points and you want to predict the potassium content for a much smaller study area. Instead of supplying the entire geochem dataset, you can use target search to determine which points in the dataset are most similar to the study area (i.e. which points have a similar distribution of feature data) and these are the points are used in addition to the provided training targets for training the model. Config ~~~~~~ Target searching is performed by adding parameters to the ``learning`` block of the config: .. code:: yaml learning: target_search: True target_search_threshold: 0.8 target_search_extents: xmin: 120 ymin: -20 xmax: 121 ymax: -21 algorithm: transformedrandomforest arguments: n_estimators: 10 target_transform: log - ``target_search``: a boolean of whether or not to use targetsearch - ``target_search_threshold``: the likelihood threshold a training point must surpass to be included in found points (i.e. how similar a training points feature distribution is to the study area) - ``target_search_extents``: extents defining the study area. Target search will select points from the training targets that have a similar feature distribution to this study area. Running ~~~~~~~ To select and return the targets, first run ``targetsearch``: .. code:: bash uncoverml targetsearch config.yaml In the output directory, there are three files: - ``targetsearch_generated_points.csv``: a list of points that were randomly generated in the study area, used for validation and debugging. - ``targetsearch_likelihood``: the likelihood of each training target belonging to the study area - ``targetsearch_result.pk``: a Python binary file containing the training targets that have surpassed the ``target_search_threshold``. Once the targets have been selected, they can be used in training a model by passing the same config to the ``learn`` command: .. code:: bash uncoverml learn config.yaml Learning will run as normal, with the addition that the targets selected by ``target_search`` will be included with the provided training targets and used to train the model. Cropping -------- UncoverML has a built-in cropping function that is useful for a variety of purposes. By providing a bounding box of ``xmin``, ``xmax``, ``ymin``, ``ymax``, all geotiffs provided to the uncoverml command will be cropped to this extent before processing takes place. Original files will be maintained and the cropped versions are stored temporarily then removed by UncoverML on completion. Config ~~~~~~ Cropping is performed by providing the ``extents`` block in the config: .. code:: yaml extents: pixel_coordinates: False xmin: 120 ymin: -20 xmax: 121 ymax: -21 - ``pixel_coordinates``: boolean indicating whether to treat the given crop coordinates as pixels (True) or as coordinates in the CRS used by the covariate geotiffs (False). If using pixel coordinates, these start at 0 and extend to the width and height of the image, inclusive. - ``xmin``, ``ymin``, ``xmax``, ``ymax``: coordinates of the crop box. If not provided or out of bounds of the covariate images, then the default bound of the covariate bounds is used respectively. Running ~~~~~~~ Cropping applies to the ``learn``, ``predict``, ``shiftmap`` and ``targetsearch`` commands of UncoverML. Running learn with the above extents in the config will crop the provided covariates to that extent, and the model will be trained on the remaining data. Similarly, running prediction with the extents block will crop the covariates and only predict on the remaining data. This is very useful for speeding up jobs or in situations where the data is too large for memory. It's also convenient when providing covariates of different sizes. UncoverML doesn't support this, but by setting the extents as the smallest intersecting area of the covariates, an intersecting stack of feature data can be generated and used for training and prediction. Using cropping to perform parallel predictions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Another use is for running parallel prediction jobs. For large-scale jobs, particularly when using ensemble models and on HPC platforms, it can be beneficial to break up prediction into multiple jobs and launch them in parallel. To achieve this, first train the model on your full dataset. Then create multiple prediction configs, each one predicting on a chunk of the data: .. code:: yaml extents: pixel_coordinates: True xmin: 0 ymin: 0 xmax: 1000 ymax: 1000 output: directory: path/to/output/directory Repeat this, providing the extents for each chunk until the whole dataset is covered. Once you have ``chunk1.yaml``, ``chunk2.yaml`` etc., submit them as individual prediction jobs: .. code:: bash uncoverml predict chunk1.yaml Doing so for each chunk config. Upon completion you will have multiple prediction maps prefixed with the config name, e.g. `chunk1_prediction.tif`. Using a GIS viewer or other program, you can then stitch the prediction chunks together. Clustering ---------- UncoverML supports k-means clustering. Config ~~~~~~ When providing a ``clustering`` block, ``learning``, ``validation`` and ``targets`` blocks don't apply and are not required. ``features`` must be provided for training and prediction and ``prediction`` block for predicting the k-means model. .. code:: yaml clustering: file: /path/to/class_labels.shp property: name_of_class_label_property n_classes: 5 oversample_factor: 5 - ``file``: path to shapefile containing labels. This is optional. If provided then semi-supervised clustering is performed, other clustering is unsupervised. - ``property``: the name of the class label property in the shapefile. Only required if a shapefile has been provided and semi-supervised clustering is being performed. - ``n_classes``: the number of classes to cluster the data into. - ``oversample_factor``: controls the number of samples drawn as part of the initialisation step. More processors used will increase the total number of samples drawn. Consider values of 1 for more than 16 processors. Running ~~~~~~~ .. code:: bash uncoverml cluster config.yaml will train and output the k-means model file. .. code:: bash uncoverml predict config.yaml will produce a map showing which pixel each class is most likely to belong to. Weighted Samples ---------------- Some models support a ``sample_weight`` parameter. Look at the documentation of the ``fit`` method for the selected algorithm to see if this is available. Sample weights can be provided by creating a weight field in your target shapefile. Weights are integer values that signify the relative importance of a sample. Weights of 0 mean the sample will be excluded entirely. Config ~~~~~~ Weights are applied by providing the name of the weight field to the ``targets`` block. .. code:: yaml targets: file: path/to/targets/shapefile.shp property: training_field weight_property: weight_field Running ~~~~~~~ Run learning as normal: .. code:: bash uncoverml learn config.yaml There will be a log info message stating *'Sample weights are being provided to fit method'* when the model is trained. If the model does not support sample weights, then there will be a log info message *'Model does not support sample weights, weights are being ignored'* and unweighted training will continue. Prediction Mask --------------- A crop mask can be provided when performing predictions. This mask will cause prediction to only predict and write values that have a certain mask value. This is useful for masking out no data areas such as the ocean, or limiting predictions to an area of interest to increase performance. The geotiff must be the same size as the provided covariates. Cropping also applies to prediction masks. Config ~~~~~~ To supply a mask, add the ``mask`` block to your config: .. code:: yaml mask: file: path/to/mask.tif retain: 1 - ``file``: path to the mask geotiff - ``retain``: value in the mask denoting which pixels to predict Running ~~~~~~~ Run prediction as normal: .. code:: bash uncoverml predict config.yaml Adding fields to output table ----------------------------- Any field in the target shapefile can be written to the ``rawcovariates.csv`` output table. This can be helpful for validation and debugging, e.g. including the type or category of the sample in the results. Config ~~~~~~ To include fields in the output table, add the ``write_to_csv`` parameter to your ``targets`` block: .. code:: yaml targets: file: path/to/targets/shapefile.shp property: training_field write_to_csv: [type, site_id] - ``write_to_csv``: list of names of shapefile fields to include in the ``rawcovariates.csv`` output table. Running ~~~~~~~ Run ``learn`` as normal and the fields will be included in the output table: .. code:: bash uncoverml learn config.yaml .. _resampling-section: Resampling ---------- UncoverML has a ``resample`` tool that allows a target shapefile to be resampled based on value or spatial parameters. Config ~~~~~~ Resampling requires a ``targets`` block with additional resampling parameters and an ``output`` block specifying a directory. .. code:: yaml targets: file: path/to/targets/shapefile property: target_field_name resample: spatial: rows: 10 cols: 10 bootstrap: False output_samples: 1000 fields_to_keep: [type, site_id] value: bins: 3 bootstrap: True interval: linear fields_to_keep: [type, site_id] output: directory: ./resampling_out - ``file``: path to the target shapefile - ``property``: name of the target observation field - ``resample``: the resampling arguments - ``spatial``: arguments for spatial resampling - ``rows``, ``cols``: int, the number of rows and columns to divide the shapefile area into. Each resulting cell is a bin. - ``bootstrap``: boolean, whether to sample with replacement. If true, then duplicate samples are allowed, otherwise each value will only be sampled once at most. - ``output_samples``: int, the number of output samples. This is the total number, so the number of samples in each bin is equal to (output_samples // bins)). If ``bootstrap`` is false and there are not enough samples in a bin to fulfill the requested number, then all samples will be taken and the bin will be short of samples. - ``fields_to_keep``: by default, the only value preserved in the resulting resampled shapefile is the property specified by ``property``. To keep additional fields, add them to this parameter. - ``value``: arguments for value resampling - ``bins``: the number of bins to divide samples into. - ``interval``: ``linear`` or ``percentile``, the method for determining bin edges. If ``linear`` then bin edges will be taken at equally spaced intervals from the min to the max of the target values. E.g., if bins is 3 and the data range is 0 to 300, then the bins will be 0 to 100, 100 to 200 and 200 to 300. If ``percentile`` then the bin edges will be the Nth percentile value of the target data, with the percentiles being the result of linearly spacing the number of bin edges between 0 to 1. E.g., if bins is 3, then the bins will cover from 0 to 33rd percentile, 33rd to 66th percentile and 66th to 100th percentile. Running ~~~~~~~ To perform resampling, run: .. code:: bash uncoverml resample config.yaml Output will be in the output directory, with the resampled shapefile in a subdirectory having the original shapefile name appended with 'resampled'. Covariate Diagnostics --------------------- UncoverML includes a ``covdiag`` command for convenience. This will output some basic diagnostics for your covariates. Running ~~~~~~~ Covdiag can be run on a single covariate: .. code:: bash uncoverml covdiag path/to/tif Or on a directory. Adding ``-r`` flag will explore the directory recursively, outputting diagnostics for all tiffs in all subdirectories. .. code:: bash uncoverml covdiag path/to/covariate/directory -r By default, output is delivered to the console. To save to a textfile, use output redirection: .. code:: bash uncoverml covdiag /path/to/covariate/directory >> output.txt