Skip to content

configuration configuration run

Jian Zhang (James) edited this page May 17, 2023 · 4 revisions

Training and Inference#

GraphStorm provides dozens of configurable parameters for users to control their training and inference tasks. This document provides detailed description of each configurable parameter. You can use YAML config file to define these parameters or you can use command line arguments to define and update these parameters. Specifically, GraphStorm parses yaml config file first. Then it parses arguments to overwrite parameters defined in the yaml file or add new parameters.

Launch Arguments#

GraphStorm’s graphstorm.run.launch command has a set of parameters to control the launch behavior of training and inference.

  • workspace: the folder where launch command assume all artifacts were saved. If the other parameters’ file paths are relative paths, launch command will consider these files in the workspace.

  • part-config: (Required) Path to a file containing graph partition configuration. The graph partition is generated by GraphStorm Partition tools. HINT: Use absolute path to avoid any path related problems. Otherwise, the file should be in workspace.

  • ip-config: (Required) Path to a file containing IPs of instances in a distributed training/inference cluster. In the ip config file, each line stores one IP. HINT: Use absolute path to avoid any path related problems. Otherwise, the file should be in workspace.

  • num-trainers: The number of trainer processes per machine. Should >0.

  • num-servers: The number of server processes per machine. Should >0.

  • num-samplers: The number of sampler processes per trainer process. Should >=0.

  • num-server-threads: The number of OMP threads in the server process. It should be small if server processes and trainer processes run on the same machine. Should >0. By default, it is 1.

  • ssh-port: SSH port used by the host node to communicate with the other nodes in the cluster.

  • ssh-username: Optional. When issuing commands (via ssh) to cluster, use the provided username in the ssh command.

  • graph-format: The format of the graph structure of each partition. The allowed formats are csr, csc and coo. A user can specify multiple formats, separated by “,”. For example, the graph format is “csr,csc”.

  • extra-envs: Extra environment parameters need to be set. For example, you can set the LD_LIBRARY_PATH and NCCL_DEBUG by adding:

    • –extra_envs LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

    • –extra-envs LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

    • NCCL_DEBUG=INFO

  • lm-encoder-only: Indicate that the model is using language model + decoder only. model. No GNN is involved, only graph structure.

Note

Below configurations can be set either in a YAML configuraiton file or be added as arguments of launch command.

Environment Configurations#

  • backend: (Required) PyTorch distributed backend, the suggested backend is gloo. Support backends include gloo and nccl
    • Yaml: backend: gloo

    • Argument: --backend gloo

    • Default value: gloo

  • verbose: Set true to print more execution information
    • Yaml: verbose: false

    • Argument: --verbose false

    • Default value: false

Model Configurations#

GraphStorm provides a set of parameters to config the GNN model structure (input layer, gnn layer, decoder layer, etc)

  • model_encoder_type: (Required) Graph encoder model used to encode graph data. It can be rgat or rgcn.
    • Yaml: model_encoder_type: rgcn

    • Argument: --model-encoder-type rgcn

    • Default value: This parameter must be provided by user.

  • node_feat_name: User defined feature name. It accepts two formats: a) fname, if a node has node features, the corresponding feature name will be fname; b) ntype0:feat0 ntype1:featA …, different node types have different node feature name(s). In the example, “ntype0” has a node feature named “feat0” and “ntype1” has a node feature named “featA”. Note: Characters : and ` ` are not allowed to be used in node feature names. And in Yaml format, need to put each node’s feature in a separated line that starts with a hyphon.
    • Yaml: node_feat_name:
      - "ntype1:featA"
      - "ntype0:feat0"
    • Argument: --node-feat-name "ntype0:feat0 ntype1:featA"

    • Default value: If not provided, there will be no node features used by GraphStorm even graphs have node features attached.

  • num_layers: Number of GNN layers. Must be an integer larger than 0 if given. By default, it is set to 0, which means no GNN layers.
    • Yaml: num_layers: 2

    • Argument: --num-layers 2

    • Default value: 0

  • hidden_size: (Required) The dimension of hidden GNN layers. Must be an integer larger than 0. Currently, each GNN layer has the same hidden dimension.
    • Yaml: hidden_size: 128

    • Argument: --hidden-size 128

    • Default value: This parameter must be provided by user.

  • use_self_loop: Set true include self feature as a special relation in relational GNN models. Used by built-in RGCN and RGAT model.
    • Yaml: use_self_loop: false

    • Argument: --use-self-loop false

    • Default value: true

Built-in Model Specific Configurations#

RGCN#

  • num_bases: Number of filter weight matrices. num_bases is used to reduce the overall parameters of a RGCN model. It allows weight metrics of different relation types to share parameters. Note: the number of relation types of the graph used in training must be divisible by num_bases. By default, num_bases is set to -1, which means weight metrics do not share parameters.
    • Yaml: num_bases: 2

    • Argument: --num-bases 2

    • Default value: -1

RGAT#

  • num_heads: Number of attention heads.
    • Yaml: num_heads: 8

    • Argument: --num-heads 8

    • Default value: 4

Model Save/Restore Configurations#

GraphStorm provides a set of parameters to control how and where to save and restore models.

  • save_model_path: A path to save GraphStorm model parameters and the corresponding optimizer status. The saved model parameters can be used in inference or model fine-tuning. See restore_model_path for how to retrieve a saved model and restore_optimizer_path for how to retrieve optimizer status.
    • Yaml: save_model_path: /model/checkpoint/

    • Argument: --save-model-path /model/checkpoint/

    • Default value: If not provide, models will not be saved.

  • save_embed_path: A path to save generated node embeddings.
    • Yaml: save_embed_path: /model/emb/

    • Argument: --save-embed-path /model/emb/

    • Default value: If not provide, models will not be saved.

  • save_model_frequency: Number of iterations to save model once. By default, GraphStorm will save models at the end of each epoch if save_model_path is provided. A user can set a positive integer, e.g. N, to let GraphStorm save models every N` iterations (mini-batches).
    • Yaml: save_model_frequency: 1000

    • Argument: --save-model-frequency 1000

    • Default value: -1. GraphStorm will not save models within an epoch.

  • topk_model_to_save: The number of top best GraphStorm model to save. By default, GraphStorm will keep all the saved models in disk, which will consume huge number of disk space. Users can set a positive integer, e.g. K, to let GraphStorm only save K` models with the best performance.
    • Yaml: topk_model_to_save: 3

    • Argument: --topk-model-to-save 3

    • Default value: 0. GraphStorm will save all the saved models in disk.

  • save_perf_results_path: Folder path to save performance results of model evaluation.
    • Yaml: save_perf_results_path: /model/results/

    • Argument: --save-perf-results-path /model/results/

    • Default value: None

  • task_tracker: A task tracker used to formalize and report model performance metrics. Now GraphStorm only supports sagemaker_task_tracker which prints evaluation metrics in a formatted way so that a user can capture those metrics through SageMaker. See Monitor and Analyze Training Jobs Using Amazon CloudWatch Metrics for more details.
    • Yaml: task_tracker: sagemaker_task_tracker

    • Argument: --task_tracker sagemaker_task_tracker

    • Default value: sagemaker_task_tracker

  • log_report_frequency: The frequency of reporting model performance metrics through task_tracker. The frequency is defined by using number of iterations, i.e., every N iterations the evaluation metrics will be reported. (Please note the evaluation metrics should be generated at the reporting iteration. See “eval_frequency” for how evaluation frequency is controlled.)
    • Yaml: log_report_frequency: 1000

    • Argument: --log-report-frequency 1000

    • Default value: 1000

  • restore_model_path: A path where GraphStorm model parameters were saved. For training, if restore_model_path is set, GraphStom will retrieve the model parameters from restore_model_path instead of initializing the parameters. For inference, restore_model_path must be provided.
    • Yaml: restore_model_path: /model/checkpoint/

    • Argument: --restore-model-path /model/checkpoint/

    • Default value: This parameter must be provided if users want to restore a saved model.

  • restore_optimizer_path: A path storing optimizer status corresponding to GraphML model parameters. This is used when a user wants to fine-tune a model from a pre-trained one.
    • Yaml: restore_optimizer_path: /model/checkpoint/optimizer

    • Argument: --restore-optimizer-path /model/checkpoint/optimizer

    • Default value: This parameter must be provided if users want to restore a saved optimizer.

Model Training Hyper-parameters Configurations#

GraphStorm provides a set of parameters to control training hyper-parameters.

  • fanout: The fanout of each GNN layers. The fanouts must be integers larger than 0. The number of fanouts must equal to num_layers. It accepts two formats: a) “20,10”, it defines number of neighbors to sample per edge type for each GNN layer with the ith element being the fanout for the ith GNN layer. In the example, the fanout of the 0th GNN layer is 20 and the fanout of the 1st GNN layer is 10. b) "etype2:20@etype3:20@etype1:10,etype2:10@etype3:4@etype1:2". It defines the numbers of neighbors to sample for different edge types for each GNN layers with the i-th element being the fanout for the i-th GNN layer. In the example, the fanouts of etype2, etype3 and etype1 of 0th GNN layer are 20, 20 and 10 respectively and the fanouts of etype2, etype3 and etype1 of 0th GNN layer are 10, 4 and 2 respectively.
    • Yaml: fanout: 10,10

    • Argument: --fanout 10,10

    • Default value: This parameter must be provided by user. But if set the --num_layers to be 0, which means there is no GNN layer, no need to specify this configuration.

  • dropout: Dropout probability. Dropout must be a float value in [0,1). Dropout is applied to every GNN layer(s).
    • Yaml: dropout: 0.5

    • Argument: --dropout 0.5

    • Default value: 0.0

  • lr: (Required) Learning rate. Learning rate for dense parameters of input encoder, model encoder and decoder.
    • Yaml: lr: 0.5

    • Argument: --lr 0.5

    • Default value: This parameter must be provided by user.

  • num_epochs: Number of training epochs. Must be integer.
    • Yaml: num_epochs: 5

    • Argument: --num-epochs 5

    • Default value: 0. By default only do testing/inference.

  • batch_size: (Required) Mini-batch size. It defines the batch size of each trainer. The global batch size equals to the number of trainers multiply the batch_size. For example, suppose we have 2 machines each with 8 GPUs and set batch_size to 128. The global batch size will be 2 * 8 * 128 = 2048.
    • Yaml: batch_size: 128

    • Argument: --batch_size 128

    • Default value: This parameter must be provided by user.

  • sparse_optimizer_lr: Learning rate of sparse optimizer. Learning rate for the optimizer corresponding to learnable sparse embeddings.
    • Yaml: sparse_optimizer_lr: 0.5

    • Argument: --sparse-optimizer-lr 0.5

    • Default value: Same as lr.

  • use_node_embeddings: Set true to use extra learnable node embedding for each node.
    • Yaml: use_node_embeddings: true

    • Argument: --use-node-embeddings true

    • Default value: false

  • wd_l2norm: Weight decay used by torch.optim.Adam.
    • Yaml: wd_l2norm: 0.1

    • Argument: --wd-l2norm 0.1

    • Default value: 0

  • alpha_l2norm: Coefficiency of the l2 norm of dense parameters. GraphStorm adds a regularization loss, i.e., l2 norm of dense parameters, to the final loss. It uses alpha_l2norm to re-scale the regularization loss. Specifically, loss = loss + alpha_l2norm * regularization_loss.
    • Yaml: alpha_l2norm: 0.00001

    • Argument: --alpha-l2norm 0.00001

    • Default value: 0.0

Early stop configurations#

GraphStorm provides a set of parameters to control early stop of training. By default, GraphStorm finishes training after num_epochs. One can use early stop to exit model training earlier.

Every time evaluation is triggered, GraphStorm checks early stop criteria. For the rounds within early_stop_burnin_rounds evaluation calls, GraphStorm will not use early stop. After early_stop_burnin_rounds, GraphStorm decides if stop early based on the early_stop_strategy. There are two strategies: 1) consecutive_increase, early stop is triggered if the current validation score is lower than the average of the last early_stop_rounds validation scores and 2) average_increase, early stop is triggered if for the last early_stop_rounds consecutive steps, the validation scores are decreasing.

  • early_stop_burnin_rounds: Burning period calls to start considering early stop.
    • Yaml: early_stop_burnin_rounds: 100

    • Argument: --early-stop-burnin-rounds 100

    • Default value: 0.0

  • early_stop_rounds: The number of rounds for validation scores used to decide if early stop.
    • Yaml: early_stop_rounds: 5

    • Argument: --early-stop-rounds 5

    • Default value: 3.

  • early_stop_strategy: GraphStorm supports two strategies: 1) consecutive_increase and 2) average_increase.
    • Yaml: early_stop_strategy: consecutive_increase

    • Argument: --early-stop-strategy average_increase

    • Default value: average_increase

  • use_early_stop: Set true to enable early stop.
    • Yaml: use_early_stop: true

    • Argument: --use-early-stop true

    • Default value: false

Model Evaluation Configurations#

GraphStorm provides a set of parameters to control model evaluation.

  • eval_batch_size: Mini-batch size for computing GNN embeddings in evaluation. You can set eval_batch_size larger than batch_size to speedup GNN embedding computation. To be noted, a larger eval_batch_size will consume more GPU memory.
    • Yaml: eval_batch_size: 1024

    • Argument: --eval-batch-size 1024

    • Default value: 10000.

  • eval_fanout: (Required) The fanout of each GNN layers used in evaluation and inference. It follows the same format as fanout.
    • Yaml: eval_fanout: "10,10"

    • Argument: --eval-fanout 10,10

    • Default value: This parameter must be provided by user. But if set the --num_layers to be 0, which means there is no GNN layer, no need to specify this configuration.

  • use_mini_batch_infer: Set true to do mini-batch inference during evaluation and inference. Set false to do full-graph inference during evaluation and inference. For node classification/regression and edge classification/regression tasks, if the evaluation set or testing set is small, mini-batch inference can be more efficient as it does not waste resources to compute node embeddings for nodes not used during inference. However, if the test set is large or the task is link prediction, full graph inference (set use_mini_batch_infer to false) is preferred, as it avoids recomputing node embeddings during inference.
    • Yaml: use_mini_batch_infer: false

    • Argument: --use-mini-batch-infer false

    • Default value: true

  • eval_frequency: The frequency of doing evaluation. GraphStorm trainers do evaluation at the end of each epoch. However, for large-scale graphs, training one epoch may take hundreds of thousands of iterations. One may want to do evaluations in the middle of an epoch. When eval_frequency is set, every eval_frequency iterations, the trainer will do evaluation once. The evaluation results can be printed and reported. See log_report_frequency for more details.
    • Yaml: eval_frequency: 10000

    • Argument: --eval-frequency 10000

    • Default value: sys.maxsize. The system will not do evaluation.

  • no_validation: Set true to avoid do model evaluation (validation) during training.
    • Yaml: no_validation: true

    • Argument: --no-validation true

    • Default value: false

Language Model Specific Configurations#

GraphStorm supports co-training language models with GNN. GraphStorm provides a set of parameters to control language model fine-tuning.

  • lm_tune_lr: Learning rate for fine-tuning language model.
    • Yaml: lm_tune_lr: 0.0001

    • Argument: --lm-tune-lr 0.0001

    • Default value: same as lr

  • lm_train_nodes: Number of nodes used in LM model fine-tuning for each different LM model.
    • Yaml: lm_train_nodes: 10

    • Argument: --lm-train-nodes 10

    • Default value: 0

  • lm_infer_batch_size: Batch size used in LM model inference.
    • Yaml: lm_infer_batch_size: 10

    • Argument: --lm-infer-batch-size 10

    • Default value: 32

  • freeze_lm_encoder_epochs: Before fine-tuning LM model, how many epochs we will take to warmup a GNN model.
    • Yaml: freeze_lm_encoder_epochs: 1

    • Argument: --freeze-lm-encoder-epochs 1

    • Default value: 0

Task Specific Configurations#

GraphStorm supports node classification, node regression, edge classification, edge regression and link prediction tasks. It provides rich task related configurations.

General Configurations#

  • task_type: (Required) Supported task type includes node_classification, node_regression, edge_classification, edge_regression, and link_prediction.
    • Yaml: task_type: node_classification

    • Argument: --task-type node_classification

    • Default value: This parameter must be provided by user.

  • eval_metric: Evaluation metric used during evaluation. The input can be a string specifying the evaluation metric to report or a list of strings specifying a list of evaluation metrics to report. The first evaluation metric is treated as the major metric and is used to choose the best trained model. The supported evaluation metrics of classification tasks include accuracy, precision_recall, roc_auc, f1_score, per_class_f1_score. The supported evaluation metrics of regression tasks include rmse and mse. The supported evaluation metrics of link prediction tasks include mrr.
    • Yaml: eval_metric:
      - accuracy
      - precision_recall
    • Argument: --eval-metric accuracy precision_recall

    • Default value:
      • For classification tasks, the default value is accuracy.

      • For regression tasks, the default value is rmse.

      • For link prediction tasks, the default value is mrr.

Classification and Regression Task#

  • label_field: (Required) The field name of labelled data in the graph data. For node classification tasks, GraphStorm use graph.nodes[target_ntype].data[label_field] to access node labels. For edge classification tasks, GraphStorm use graph.edges[target_etype].data[label_field] to access edge labels.
    • Yaml: label_field: color

    • Argument: --label-field color

    • Default value: This parameter must be provided by user.

  • num_classes: (Required) The cardinality of labels in a classification task. Used by node classification and edge classification.
    • Yaml: num_classes: 10

    • Argument: --num-classes 10

    • Default value: This parameter must be provided by user.

  • multilabel: If set to true, the task is a multi-label classification task. Used by node classification and edge classification.
    • Yaml: multilabel: true

    • Argument: --multilabel true

    • Default value: false

  • multilabel_weights: Used to specify label weight of each class in a multi-label classification task. This is used together with multilabel. It is feed into torch.nn.BCEWithLogitsLoss. The weights should be in the following format 0.1,0.2,0.3,0.1,0.0. Each field represents a weight for a class. Suppose there are 3 classes. The multilabel_weights is set to 0.1,0.2,0.3. Class 0 will have weight of 0.1, class 1 will have weight of 0.2 and class 2 will have weight of 0.3. For more details, see BCEWithLogitsLoss. If not provided, all classes are treated equally.
    • Yaml: multilabel_weights: 0.1,0.2,0.3

    • Argument: --multilabel-weights 0.1,0.2,0.3

    • Default value: None

  • imbalance_class_weights: Used to specify a manual rescaling weight given to each class in a single-label multi-class classification task. It is used in imbalanced label use cases. It is feed into torch.nn.CrossEntropyLoss. Each field represents a weight for a class. Suppose there are 3 classes. The imbalance_class_weights is set to 0.1,0.2,0.3. Class 0 will have weight of 0.1, class 1 will have weight of 0.2 and class 2 will have weight of 0.3. If not provided, all classes are treated equally.
    • Yaml: imbalance_class_weights: 0.1,0.2,0.3

    • Argument: --imbalance-class-weights 0.1,0.2,0.3

    • Default value: None

  • save_prediction_path: Path to save prediction results. This is used in node/edge classification/regression inference.
    • Yaml: save_prediction_path: /data/infer-output/predictions/

    • Argument: --save-prediction-path /data/infer-output/predictions/

    • Default value: If not provided, it will be the same as save_embed_path.

Node Classification/Regression Specific#

  • target_ntype: (Required) The node type for prediction.
    • Yaml: target_ntype: movie

    • Argument: --target-ntype movie

    • Default value: This parameter must be provided by user.

Edge Classification/Regression Specific#

  • target_etype: (Required) The list of canonical edge types that will be added as a training target in edge classification/regression tasks, for example --train-etype query,clicks,asin or --train-etype query,clicks,asin query,search,asin. A canonical edge type should be formatted as src_node_type,relation_type,dst_node_type. Currently, GraphStorm only supports single task edge classification/regression, i.e., it only accepts one canonical edge type.
    • Yaml: target_etype:
      - query,clicks,asin
    • Argument: --target-etype query,clicks,asin

    • Default value: This parameter must be provided by user.

  • remove_target_edge_type: When set to true, GraphStorm removes target_etype in message passing, i.e., any edge with target_etype will not be sampled during training and inference.
    • Yaml: remove_target_edge_type: false

    • Argument: --remove-target-edge-type false

    • Default value: true

  • reverse_edge_types_map: A list of reverse edge type info. Each edge type is in the following format: head,relation,reverse_relation,tail. For example: [“query,adds,rev-adds,asin”, “query,clicks,rev-clicks,asin”]. For edge classification/regression tasks, if remove_target_edge_type is set true and reverse_edge_type_map is provided, GraphStorm will remove both target_etype and the corresponding reverse edge type(s) in message passing. In certain cases, any edge with target_etype or reverse target_etype will not be sampled during training and inference. For link prediction tasks, if exclude_training_targets is set to true and reverse_edge_type_map is provided, GraphStorm will remove both target edges with train_etype and the corresponding reverse edges with the reverse edge types of train_etype in message passing. In contrast to edge classification/regression tasks, for link prediction tasks, GraphStorm only excludes specific edges instead of all edges with target_etype or reverse target_etype in message passing.
    • Yaml: reverse_edge_types_map:
      - query,adds,rev-adds,asin
      - query,clicks,rev-clicks,asin
    • Argument: --reverse-edge-types-map query,adds,rev-adds,asin query,clicks,rev-clicks,asin

    • Default value: None

  • decoder_type: Type of edge classification or regression decoder. Built-in decoders include DenseBiDecoder and MLPDecoder. DenseBiDecoder implements the bi-linear decoder used in GCMC. MLPEdgeDecoder simply applies Multilayer Perceptron layers for prediction.
    • Yaml: decoder-type: DenseBiDecoder

    • Argument: --decoder-type MLPDecoder

    • Default value: DenseBiDecoder

  • num_decoder_basis: The number of basis for DenseBiDecoder in edge prediction task.
    • Yaml: num_decoder_basis: 2

    • Argument: --num-decoder-basis 2

    • Default value: 2

Link Prediction Task#

  • train_etype: The list of canonical edge type that will be added as training target with the target edge type(s). If not provided, all edge types will be used as training target. A canonical edge type should be formatted as src_node_type,relation_type,dst_node_type.
    • Yaml: train_etype:
      - query,clicks,asin
      - query,adds,asin
    • Argument: --train-etype query,clicks,asin query,adds,asin

    • Default value: None

  • eval_etype: The list of canonical edge type that will be added as evaluation target with the target edge type(s). If not provided, all edge types will be used as evaluation target. In some link prediction use cases, users want to train a model using all edges of a graph but only do link prediction on specific edge type(s) for downstream applications. In certain cases, they only care about the model performance on specific edge types.
    • Yaml: eval_etype:
      - query,clicks,asin
      - query,adds,asin
    • Argument: --eval-etype query,clicks,asin query,adds,asin

    • Default value: None

  • exclude_training_targets: If it is set to true, GraphStorm removes the training targets from the GNN computation graph. If true, reverse_edge_types_map MUST be provided.
    • Yaml: exclude_training_targets: false

    • Argument: --exclude-training-targets false

    • Default value: true

  • train_negative_sampler: The negative sampler used for link prediction training. Built-in samplers include uniform, joint, localuniform, all_etype_uniform and all_etype_joint.
    • Yaml: train_negative_sampler: uniform

    • Argument: --train-negative-sampler joint

    • Default value: uniform

  • eval_negative_sampler: The negative sampler used for link prediction testing and evaluation. Built-in samplers include uniform, joint, localuniform, all_etype_uniform and all_etype_joint.
    • Yaml: eval_negative_sampler: uniform

    • Argument: --eval-negative-sampler joint

    • Default value: joint

  • num_negative_edges: Number of negative edges sampled for each positive edge during training.
    • Yaml: num_negative_edges: 32

    • Argument: --num-negative-edges 32

    • Default value: 16

  • num_negative_edges_eval: Number of negative edges sampled for each positive edge in the validation and test set.
    • Yaml: num_negative_edges_eval: 1000

    • Argument: --num-negative-edges-eval 1000

    • Default value: 1000

  • lp_decoder_type: Set the decoder type for loss function in Link Prediction tasks. Currently GraphStorm support dot_product and DistMult.
    • Yaml: lp_decoder_type: dot_product

    • Argument: --lp-decoder-type dot_product

    • Default value: dot_product

  • gamma: Gamma for DistMult. The margin value in the score function.
    • Yaml: gamma: 10.0

    • Argument: --gamma 10.0

    • Default value: 12.0

  • lp_loss_func: Link prediction loss function. Builtin loss functions include cross_entropy and logsigmoid.
    • Yaml: lp_loss_func: cross_entropy

    • Argument: --lp-loss-func logsigmoid

    • Default value: cross_entropy