You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The AMPL pipeline contains many parameters and options to fit models and make predictions. The parameters have been organized in the following sections:
Name of datastore bucket. Specific to LLNL datastore system.
Default:
gsk_ml
dataset_key
Description:
Datastore key (LLNL system) or file path for dataset.
dataset_name
Description:
Parameter for overriding the output files/dataset object names. Default is set within model_pipeline.
dataset_oid
Description:
OID of the model dataset inserted into the datastore. Specific to LLNL datastore system.
datastore
Description:
Boolean flag for using an input file from the LLNL specific datastore system based on a key of dataset_key
Default:
FALSE
Type:
Bool
id_col
Description:
Name of column containing compound IDs. Will default to "compound_id" if not specified
Default:
compound_id
min_compound_number
Description:
Minimum number of dataset compounds considered adequate for model training. A warning message will be issued if the dataset size is less than this.
Default:
200
Type:
int
response_cols
Description:
name of column(s) containing response values. Will default to last column if not specified. Can be input as a string of comma separated values or as a comma separated list (e.g. 'column1','column2'). Multitask models will be generated when multiple columns are specified.
save_results
Description:
Save model results to MongoDB. LLNL model_tracker system specific
Default:
FALSE
Type:
BOOL
smiles_col
Description:
Name of column containing SMILES strings. Will default to "rdkit_smiles" if not specified
Default:
rdkit_smiles
Model Building Parameters
Autoencoders
autoencoder_bucket
Description:
datastore bucket for the autoencoder file. Specific to LLNL datastore system. TODO: Not yet implemented
Default:
gsk_ml
autoencoder_key
Description:
Base of key for the autoencoder. TODO: Not yet implemented
autoencoder_type
Description:
Type of autoencoder being used as features. TODO: not yet implemented
Default:
molvae
mol_vae_model_file
Description:
Trained model HDF5 file path, only needed for MolVAE featurizer
Classifiers
class_name
Description:
User specified list of names of each class
class_number
Description:
User specified number of classes. This is required for NN models but inferred for RF and XGBoost models.
Default:
2
Type:
int
Descriptors
descriptor_bucket
Description:
datastore bucket for the descriptor file. Specific to LLNL datastore system.
Default:
gskdata
descriptor_key
Description:
Base of key for descriptor table file. Subset files will be prepended with "subset" and appended with the dataset name. Specific to LLNL datastore system.
descriptor_oid
Description:
dataset_oid for the descriptor file in the datastore. Specific to LLNL datastore system.
descriptor_spec_bucket
Description:
Bucket where descriptor specification is located for a descriptor type. Specific to LLNL datastore system.
Default:
public
descriptor_spec_key
Description:
Datastore key or file path for a table specifying descriptor columns for each descriptor type. Specific to LLNL datastore system.
Default:
descriptor_sets_sources_by_descr_type.csv
descriptor_type
Description:
Type of descriptors being used as features, e.g. moe, dragon7, used when featurizer = "computed_descriptors". Sets the subclass within featurizer.py
Default:
moe
Options:
'moe', 'mordred_filtered', and 'rdkit_raw' are recommended. See atomsci/ddm/data/descriptor_sets_sources_by_descr_type.csv for more.
ECFP
ecfp_radius
Description:
Radius used for ECFP generation
Default:
2
Type:
int
ecfp_size
Description:
Size of ECFP bit vectors
Default:
1024
Type:
int
General
featurizer
Description:
Type of featurizer to use on chemical structures. Current supported options: ["ecfp","graphconv","molvae","computed_descriptors","descriptors"]. Further information on descriptors are in descriptor_type. Options are used to set the featurization subclass in the create_featurization method of featurization.py. Can be input as a comma separated list for hyperparameter search (e.g. 'ecfp','molvae')
Type:
str
model_choice_score_type
Description:
Type of score function used to choose best epoch and/or hyperparameters (defaults to "roc_auc" for classification and "r2" for regression).
model_type
Description:
Type of model to fit (NN, RF, or xgboost). The model_type sets the model subclass in model_wrapper. Can be input as a comma separated list for hyperparameter search (e.g. 'NN','RF','xgboost')
Type:
str
prediction_type
Description:
Sets the prediction type of the model to a choice between ["regression","classification"]. Used as a flag for model behavior throughout the pipeline.
Default:
regression
Type:
choice
previously_featurized
Description:
Boolean flag for loading in previously featurized data files. If set to True, the method get_featurized_data within model_datasets will attempt to load the featurized dataset associated with the given dataset_oid parameter
Default:
TRUE
Type:
Bool
uncertainty
Description:
Boolean flag for computing uncertainty estimates for regression model predictions. Will also change the default values for dropouts if set to True.
Default:
TRUE
Type:
Bool
verbose
Description:
True/False flag for setting verbosity
Default:
FALSE
Type:
Bool
production
Description:
True/False flag for training models in production mode. The entire dataset is used in training, validation, and test. If using training epocs
the model will train for max_epochs regardless of validation error.
Default:
FALSE
Type:
Bool
Graph Convolution
optimizer_type
Description:
Optimizer specific for graph conv, defaults to "adam"
Default:
adam
Mordred
mordred_cpus
Description:
Max number of CPUs to use for Mordred descriptor computations. None means use all available
Type:
int
Neural Networks
baseline_epoch
Description:
Baseline epoch at which to evaluate performance for DNN models
Default:
30
Type:
int
batch_size
Description:
Sets the model batch size within model_wrapper
Default:
50
Type:
int
bias_init_consts
Description:
Comma-separated list of initial bias parameters per layer for dense NN models with conditional values. Defaults to [1.0]*len(layer_sizes). Must be same length as layer_sizes. Can be input as a space-separated list of comma-separated lists for hyperparameters. Hyperparameter example: '1.0,1.0 0.9,0.9 0.8,0.9' Default behavior is set within __init__ method of DCNNModelWrapper. Defaults: all:[1.0,1.0]
dropouts
Description:
Comma-separated list of dropout rates per layer for NN models with default values conditional on featurizer. Default behavior is controlled in model_wrapper.py. Must be same length as layer_sizes. Can be input as a space-separated list of comma-separated lists for hyperparameters (e.g. '0.4,0.4 0.2,0.2 0.3,0.3'). Default behavior is set within __init__ method of DCNNModelWrapper. Defaults: graphconv: [0,0,0], non-graphconv:[0.40,0.40]
Type:
list
layer_sizes
Description:
Comma-separated list of layer sizes for NN models with default values conditional on featurizer. Must be same length as layer_sizes. Can be input as a space-separated list of comma-separated lists for hyperparameters (e.g. '64,16 200,100 1000,500'). Default behavior is set within __init__ method of DCNNModelWrapper. Defaults: graphconv: [64,64,128], ecfp: [1000,500], descriptors: [200,100]
Type:
list
learning_rate
Description:
Learning rate for dense NN models. Input as comma separated floats for hyperparameters (e.g. '0.0005,0.0004,0.0003')
Default:
0.0005
max_epochs
Description:
Maximum number of training epochs to run for DNN models. Default 30.
Default:
30
Type:
int
weight_decay_penalty
Description:
weight_decay_penalty: float. The magnitude of the weight decay penalty to use. Can be input as a comma separated list of strings for hyperparameter search (e.g. '0.0001,0.0002,0.0003') default 0.0001
Default:
0.0001
weight_decay_penalty_type
Description:
weight_decay_penalty_type: str. The type of penalty to use for weight decay, either "l1" or "l2". Can be input as a comma separated list for hyperparameter search (e.g. 'l1,l2') default: "l2"
Default:
l2
Type:
str
weight_init_stddevs
Description:
Comma-separated list of standard deviations per layer for initializing weights in dense NN models with conditional values. Must be same length as layer_sizes. Can be input as a space-separated list of comma-separated lists for hyperparameters (e.g. '0.001,0.001 0.002,0.002 0.03,003'). Default behavior is set within __init__ method of DCNNModelWrapper. Defaults: all: [0.02,0.02]
Default:
[0.02]*len(param.layer_size)
Random Forests
rf_estimators
Description:
Number of estimators to use in random forest models. Hyperparameter searching requires 3 inputs: start, end, step when used with search_type geometric or grid (example: '100,500,100') or can be input as a list of possible values for search_type user_specified (example: '100,200,300,400,500')
Default:
500
rf_max_depth
Description:
The maximum depth of a decision tree in the random forest. Hyperparameter searching requires 3 inputs: start, end, step when used with search_type geometric or grid (example: '4,7,1') or can be input as a list of possible values for search_type user_specified (example: '4,5,6,7')
rf_max_features
Description:
Max number of features to split random forest nodes. Hyperparameter searching requires 3 inputs: start, end, step when used with search_type geometric or grid (example: '16,32,4') or can be input as a list of possible values for search_type user_specified (example: '16,20,24,28,32')
Default:
32
Hybrid model
is_ki
Description:
True/False flag for noting whether the dose-response activity is Ki or XC50, if it is True, the following ki_convert_ratio is also needed to convert Ki into IC50 and to single concentration activity.
Default:
False
ki_convert_ratio
Description:
To convert Ki into IC50, a ratio is needed. It can be the ratio of [S]/Km for enzymatic inhibition assays, [S] is the concentration of substrate Km is the Michaelis constant. It can also be [S]/Kd for radioligand competitive binding, [S] is the concentration of the radioligand, Kd is its dissociation constant. The [S] and Kd/Km should have the same unit so that the ratio is unitless.
Default:
None
loss_func
Description:
The loss function used in the hybrid model training, currently support poisson and l2
Default:
poisson
Splitting
base_splitter
Description:
Type of splitter to use for train/validation split if temporal split used for test set. May be random, scaffold, or ave_min. The allowable choices are set in splitter.py
Default:
scaffold
Type:
str
butina_cutoff
Description:
cutoff Tanimoto similarity for clustering in Butina splitter. TODO: will be implemented when DeepChem updates their butina splitter. TODO rename to butina_cutoff in v2
Default:
0.18
Type:
float
cutoff_date
Description:
Cutoff date for test set compounds in temporal splitter TODO: Needs some formatting guidelines
Type:
str
date_col
Description:
Column in dataset containing dates for temporal splitter
Type:
str
num_folds
Description:
Number of k-folds to use in k-fold cross validation
Default:
5
Type:
int
previously_split
Description:
Boolean flag for loading in previously split train, validation, and test csv files.
Default:
FALSE
Type:
bool
split_strategy
Description:
Choice of splitting type between "k_fold_cv" for k fold cross validation and "train_valid_test" for a normal train/valid/test split. If split_test_frac or split_valid_frac are not set, "train_valid_test" sets are split according to the model type default
Default:
train_valid_test
Type:
Choice
split_test_frac
Description:
Fraction of data to put in held-out test set for train_valid_test split strategy. TODO: Behavior of split_test_frac is dependent on the DeepChem model_wrapper.
Default:
0.1
Type:
float
split_uuid
Description:
UUID for csv file containing train, validation, and test split information
split_valid_frac
Description:
Fraction of data to put in validation set for train_valid_test split strategy. TODO: Behavior of split_valid_frac is dependent on the DeepChem model_wrapper.
Default:
0.1
Type:
float
splitter
Description:
Type of splitter to use: index, random, scaffold, butina, ave_min, temporal, fingerprint, multitaskscaffold, or stratified. Used to set the splitting.py subclass. Can be input as a comma separated list for hyperparameter search (e.g. 'scaffold','random')
Default:
scaffold
Type:
str
mtss_num_super_scaffolds
Description:
This specifies the number of genes in a chromosome for the genetic algorithm. Scaffolds bins are often very small and only contain 1 compound. Scaffolds are therefore combined into super scaffolds to the number of genes and also reduce complexity and runtime.
Default:
40
Type:
int
mtss_num_generations
Description:
The number of generations the genetic algorithm will run.
Default:
20
Type:
int
mtss_num_pop
Description:
Size of population per generation in the genetic algorithm.
Default:
100
Type:
int
mtss_train_test_dist_weight
Description:
How much weight to give the tanimoto distance between training and test partitions.
Default:
1.0
Type:
float
mtss_train_valid_dist_weight
Description:
How much weight to give the tanimoto distance between training and valid partitions.
Default:
1.0
Type:
float
mtss_response_distr_weight
Description:
How much weight to give to matching the response value distributions between split subsets.
Default:
1.0
Type:
float
mtss_split_fraction_weight
Description:
How much weight to give adherence to requested subset franctions.
Default:
1.0
Type:
float
Transformers
feature_transform_type
Description:
type of transformation for the features
Default:
normalization
Type:
Choice
response_transform_type
Description:
type of transformation for the response column (defaults to "normalization") TODO: Not currently implemented
Default:
normalization
transformer_bucket
Description:
Datastore bucket where the transformer is stored. Specific to LLNL datastore system.
Default:
gsk_ml
transformer_key
Description:
Path to a saved transformer (stored as tuple, e.g. (transform_features, transform_response))
Type:
str
transformer_oid
Description:
Dataset oid of the transformer saved in the datastore. Specific to LLNL datastore system.
transformers
Description:
Boolean switch for using transformation on regression output. Default is True
Default:
TRUE
Type:
Bool
UMAP
umap_dim
Description:
Dimension of projected feature space, if UMAP transformation is requested. Can be input as a comma separated list for hyperparameter search (e.g. '2,6,10').
Default:
10
umap_metric
Description:
Distance metric used, if UMAP transformation is requested. Can be input as a comma separated list for hyperparameter search (e.g. 'euclidean','cityblock')
Default:
euclidean
umap_min_dist
Description:
Minimum distance used in UMAP projection, if UMAP transformation is requested. Can be input as a comma separated list for hyperparameter search (e.g. '0.01,0.02,0.05')
Default:
0.05
umap_neighbors
Description:
Number of nearest neighbors used in UMAP projection, if UMAP transformation is requested. Can be input as a comma separated list for hyperparameter search (e.g. '10,20,30')
Default:
20
umap_targ_wt
Description:
Weight given to training set response values in UMAP projection, if UMAP transformation is requested. Can be input as a comma separated list for hyperparameter search (e.g. '0.0,0.1,0.2')
Default:
0.0
XGBoost
xgb_colsample_bytree
Description:
Subsample ratio of columns when constructing each tree. Can be input as a comma separated list for hyperparameter search (e.g. '0.8,0.9,1.0')
Default:
1.0
xgb_gamma
Description:
Minimum loss reduction required to make a further partition on a leaf node of the tree. Can be input as a comma separated list for hyperparameter search (e.g. '0.0,0.1,0.2')
Default:
0.0
xgb_learning_rate
Description:
Boosting learning rate (xgboost's "eta"). Can be input as a comma separated list for hyperparameter search (e.g. '0.1,0.01,0.001')
Default:
0.1
xgb_max_depth
Description:
Maximum tree depth for base learners. Can be input as a comma separated list for hyperparameter search (e.g. '4,5,6')
Default:
6
xgb_min_child_weight
Description:
Minimum sum of instance weight(hessian) needed in a child. Can be input as a comma separated list for hyperparameter search (e.g. '1.0,1.1,1.2')
Default:
1.0
xgb_n_estimators
Description:
Number of estimators to use in xgboost models. Can be input as a comma separated list for hyperparameter search (e.g. '100,200,300')
Default:
100
xgb_subsample
Description:
Subsample ratio of the training instance. Can be input as a comma separated list for hyperparameter search (e.g. '0.8,0.9,1.0')
Default:
1.0
Additional DeepChem Models and Featurizers
As of version 1.3 AMPL partially supports several DeepChem models. It is possible to train and predict
using these models, but they are not currently integrated with the hyperparameter search wrapper.
These models can be selected by using the model_type paramter e.g. "model_type":"AttentiveFPModel".
Parameters for each model can be passed in by prefixing the parameter with the name of the model.
Each DeepChem model expects a specific featurizer. Model/Featurizer compatibility is listed in this table.
Featurizers can be specified by setting the featurizer parameter. Featurizer parameters are passed
in the same way as model parameters.
MongoDB collection to save model results in. Specific to LLNL model tracker system.
Default:
model_tracker
data_owner
Description:
Option for setting group permissions for created files. Options: ['username', 'data_owner_group', 'gsk', 'public']. Specific to LLNL model tracker system.
Default:
gsk
data_owner_group
Description:
When data_owner is set to data_owner_group, this is the option for custom group name of created files. Specific to LLNL model tracker system.
Default:
gskcraa
model_bucket
Description:
Bucket in the datastore for the model. Specific to LLNL model tracker system.
Default:
gsk_ml
Type:
str
model_dataset_oid
Description:
OID of the model dataset inserted into the datastore. Specific to LLNL model tracker system
model_filter
Description:
Path to the model filter configuration file. Is loaded and stored as a dictionary. Specific to LLNL model tracker system.
model_uuid
Description:
UUID generated after model creation (pythonic_ID). Specific to LLNL model tracker system.
Type:
str
output_dir
Description:
File location where the model output will be saved. Defaults to <result_dir>/ TODO: this parameter is redundant with result_dir
result_dir
Description:
Parent of directory where result files will be written, defaults to '/usr/local/data'
Default:
/usr/local/data/
Model Metadata
system
Description:
Computational system you are running on, LC or twintron-blue. LLNL system specific
Default:
twintron-blue
Type:
str
Miscellaneous
config_file
Description:
Full path to the optional configuration file. The configuration file is a set of parameters in .json file format. TODO: Does not send a warning if set concurrently with other parameters.
num_model_tasks
Description:
DEPRECATED AND IGNORED. This argument is now infered from the response_cols. Number of tasks to run for. 1 means a singletask model, > 1 means a multitask model
Default:
1
Type:
int
Hyperparameter Optimization
dropout_list
Description:
Comma-separated list of dropout rates for permutation of NN layers (e.g. '0.0,0.4,0.6'). Used within permutate_NNlayer_combo_params to return combinations from layer_nums, node_nums, dropout_list and max_final_layer_size. dropout_list is used to set the allowable permutations of dropouts. For hyperparameters only.
hyperparam
Description:
Boolean flag to indicate whether we are running the hyperparameter search script
Default:
FALSE
hyperparam_uuid
Description:
UUID of hyperparam search run model was generated in. Not applicable for single-run jobs. Specific to LLNL model tracker system.
layer_nums
Description:
Comma-separated list of number of layers for permutation of NN layers. (e.g. '2,3,4'). Used within permutate_NNlayer_combo_params to return combinations from layer_nums, node_nums, dropout_list and max_final_layer_size. layer_nums is used to set the allowable lengths of layer_sizes. For hyperparameters only.
lc_account
Description:
SLURM account to charge hyperparameter batch runs to. This will be replaced by the slurm_account option. If lc_account and slurm_account are both set, slurm_account will be used. If set to None then this parameter will not be used.
Default:
baasic
max_final_layer_size
Description:
The max number of nodes in the last layer within layer_sizes and dropouts in hyperparameter search; max_final_layer_size = min(node_nums) if min(node_nums) > max_final_layer_size. (e.g. '16,32'). Used within permutate_NNlayer_combo_params to return combinations from layer_nums, node_nums, dropout_list and max_final_layer_size.
Default:
32
node_nums
Description:
Comma-separated list of number of nodes per layer for permutation of NN layers. (e.g. '4,8,16'). Used within permutate_NNlayer_combo_params to return combinations from layer_nums, node_nums, dropout_list and max_final_layer_size. node_num is used to set the node values within layer_sizes. For hyperparameters only.
max_jobs
Description:
Max number of jobs to be in the queue at one time for an LC machine. Specific to LLNL system.
Default:
80
Type:
int
nn_size_scale_factor
Description:
Scaling factor for constraining network size based on number of parameters in the network for hyperparam search
Default:
1
Type:
float
python_path
Description:
Path to desired python version
Default:
This defaults to the Python instllation used to parse the JSON file. This is done by using sys.executable
rerun
Description:
After parameter combos have been generated, rerun=False will check the model tracker to see if a model with a particular param combination has already been built. If it’s been built, do not create a new model or submit a slurm job. If rerun=True, the check will be skipped completely and a slurm job will be submitted regardless of whether a model has previously been built with these parameters. Specific to hyperparameter search.
Default:
TRUE
Type:
Bool
script_dir
Description:
Path where pipeline file you want to run hyperparam search from is located
Default:
.
search_type
Description:
Type of hyperparameter search to do. Options = [grid, random, geometric, hyperopt and user_specified]
Default:
grid
shortlist_key
Description:
CSV file of assays of interest. Specific to LLNL model tracker system.
slurm_account
Description:
SLURM account to charge hyperparameter batch runs to. This will replace the lc_account option. If lc_account and slurm_account are both set, slurm_account will be used. If set to None then this parameter will not be used.
Default:
None
slurm_export
Description:
SLURM environment variables propagated for hyperparameter search batch jobs. If set to None then this parameter will not be used.
Default:
ALL
slurm_nodes
Description:
Number of nodes for hyperparameter search batch jobs. If set to None then this parameter will not be used.
Default:
1
Type:
int
slurm_options
Description:
Additional SLURM options for hyperparameter search batch jobs. Example: '--option1=value1 --option2=value2'. If set to None then this parameter will not be used.
Default:
None
slurm_partition
Description:
SLURM partition to run hyperparameter batch runs on. If set to None then this parameter will not be used.
Default:
pbatch
slurm_time_limit
Description:
Time limit in minutes for hyperparameter search batch jobs.
Default:
1440
Type:
int
split_only
Description:
Boolean flag used with model_pipeline.py to indicate splitting of the datasets when running the hyperparameter search
Default:
FALSE
Type:
bool
use_shortlist
Description:
Use a list of assays. Specific to LLNL model tracker system.
Default:
FALSE
Type:
Bool
Bayesian Optimization
Search Domain Specifications
The following parameters are used to specify the search domains for certain model parameters in a Bayesian hyperparameter optimization. Each search domain parameter is
tied to a specific model parameter. Only a subset of model parameters may be optimized in this way, but more will be supported in future releases. See the hyperopt package documentation at https://github.com/hyperopt/hyperopt/wiki/FMin#2-defining-a-search-space to learn more about the search domain format.
Search domain for NN model dropouts parameter in Bayesian Optimization. The format is scheme|num_layers|parameters, e.g. uniform|3|0,0.4, Note that the number of layers (number between two |) can not be changed during optimization, if you want to try different number of layers, just run several optimizations.
Default:
None
ls
Description:
Search domain for NN model layer_sizes parameter in Bayesian Optimization. The format is scheme|num_layers|parameters, e.g. uniformint|3|8,512, Note that the number of layers (number between two |) can not be changed during optimization, if you want to try different number of layers, just run several optimizations.
Default:
None
rfe
Description:
Search domain for RF model rf_num_estimators parameter in Bayesian Optimization. The format is scheme|parameters, e.g. uniformint|8,512.
Default:
None
rfd
Description:
Search domain for RF model rf_max_depth parameter in Bayesian Optimization. The format is scheme|parameters, e.g. uniformint|8,512.
Default:
None
rff
Description:
Search domain for RF model rf_max_features parameter in Bayesian Optimization. The format is scheme|parameters, e.g. uniformint|8,200.
Default:
None
xgbg
Description:
Search domain for XGBoost model xgb_gamma parameter in Bayesian Optimization. The format is scheme|parameters, e.g. loguniform|-9.2,-4.6.
Default:
None
xgbl
Description:
Search domain for XGBoost model xgb_learning_rate parameter in Bayesian Optimization. The format is scheme|parameters, e.g. loguniform|-4.6,-2.3.
Default:
None
xgbd
Description:
Search domain for XGBoost model xgb_max_depth parameter in Bayesian Optimization. The format is scheme|parameters, e.g. uniformint|3,10.
Default:
None
xgbc
Description:
Search domain for XGBoost model xgb_colsample_bytree parameter in Bayesian Optimization. The format is scheme|parameters, e.g. uniform|0.1,1.0.
Default:
None
xgbs
Description:
Search domain for XGBoost model xgb_subsample parameter in Bayesian Optimization. The format is scheme|parameters, e.g. uniform|0.1,1.0.
Default:
None
xgbn
Description:
Search domain for XGBoost model xgb_n_estimators parameter in Bayesian Optimization. The format is scheme|parameters, e.g. uniformint|200,1000.
Default:
None
xgbw
Description:
Search domain for XGBoost model xgb_min_child_weight parameter in Bayesian Optimization. The format is scheme|parameters, e.g. uniform|0.5,2.0.
Default:
None
Checkpointing parameters
hp_checkpoint_save
Description:
binary file to save a checkpoint of the HPO trial project, which can be use to continue the HPO search later.
Default:
None
hp_checkpoint_load
Description:
binary file to load a checkpoint of a previous HPO trial project, to continue the HPO search.