Skip to content

Latest commit

 

History

History
169 lines (133 loc) · 11.4 KB

scala.md

File metadata and controls

169 lines (133 loc) · 11.4 KB

Scala API for XGBoost-Spark

This doc focuses on GPU related Scala API interfaces. Six new classes are introduced:

GpuDataset

The full name is ml.dmlc.xgboost4j.scala.spark.rapids.GpuDataset. A GpuDataset is an object that is produced by GpuDataReaders and consumed by XGBoostClassifiers and XGBoostRegressors. No constructors or methods are exposed for this class.

GpuDataReader

The full name is ml.dmlc.xgboost4j.scala.spark.rapids.GpuDataReader. A GpuDataReader sets options and builds GpuDataset from data sources. The data loading is a lazy operation. It occurs when the data is processed later.

Constructors
  • GpuDataReader(sparkSession: SparkSession)
    • sparkSession: spark session for data loading
Methods
  • format(source: String): GpuDataReader. This method sets data format. Valid values include csv, parquet and orc.
    • source: data format to set
    • returns the data reader itself
  • schema(schema: StructType): GpuDataReader. This method sets data schema.
    • schema: data schema in StructType format
    • returns the data reader itself
  • schema(schemaString: String): GpuDataReader. This method sets data schema.
    • schemaString: data schema in DDL-formatted String, e.g., a INT, b STRING, c DOUBLE
    • returns the data reader itself
  • option(key: String, value: String): GpuDataReader. This method sets an option.
    • key: the option key
    • value: the option value in string format
    • returns the data reader itself
  • option(key: String, value: Boolean): GpuDataReader. This method sets an option.
    • key: the option key
    • value: the Boolean option value
    • returns the data reader itself
  • option(key: String, value: Long): GpuDataReader. This method sets an option.
    • key: the option key
    • value: the Long option value
    • returns the data reader itself
  • option(key: String, value: Double): GpuDataReader. This method sets an option.
    • key: the option key
    • value: the Double option value
    • returns the data reader itself
  • options(options: scala.collection.Map[String, String]): GpuDataReader. This method sets options.
    • options: the options Map to set
    • returns the data reader itself
  • options(options: java.util.Map[String, String]): GpuDataReader. This method sets options. It is designed for Java compatibility.
    • options: the options Map to set
    • returns the data reader itself
  • load(): GpuDataset. This method builds a GpuDataset.
  • load(path: String): GpuDataset. This method builds a GpuDataset.
    • path: the data source path
    • returns a GpuDataset as the result
  • load(paths: String*): GpuDataset. This method builds a GpuDataset.
    • paths: the data source paths
    • returns a GpuDataset as the result
  • csv(path: String): GpuDataset. This method builds a GpuDataset.
    • path: the CSV data path
    • returns a GpuDataset as the result
  • csv(paths: String*): GpuDataset. This method builds a GpuDataset.
    • paths: the CSV data paths
    • returns a GpuDataset as the result
  • parquet(path: String): GpuDataset. This method builds a GpuDataset.
    • path: the Parquet data path
    • returns a GpuDataset as the result
  • parquet(paths: String*): GpuDataset. This method builds a GpuDataset.
    • paths: the Parquet data paths
    • returns a GpuDataset as the result
  • orc(path: String): GpuDataset. This method builds a GpuDataset.
    • path: the ORC data path
    • returns a GpuDataset as the result
  • orc(paths: String*): GpuDataset. This method builds a GpuDataset.
    • paths: the ORC data paths
    • returns a GpuDataset as the result
Options
  • Common options
    • asFloats: A Boolean flag indicates whether cast all numeric values to floats. Default is true.
    • maxRowsPerChunk: An Int specifies the max rows per chunk. Default is Int.MaxValue.
  • Options for CSV
    • comment: A single character used for skipping lines beginning with this character. Default is empty string. By default, it is disabled.
    • header: A Boolean flag indicates whether the first line should be used as names of columns. Default is false.
    • nullValue: The string representation of a null value. Default is empty string.
    • quote: A single character used for escaping quoted values where the separator can be part of the value. Default is ".
    • sep: A single character as a separator between adjacent values. Default is ,.

XGBoostClassifier

The full name is ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier. It extends ProbabilisticClassifier[Vector, XGBoostClassifier, XGBoostClassificationModel].

Constructors
  • XGBoostClassifier(xgboostParams: Map[String, Any])
Methods

Note: Only GPU related methods are listed below.

  • setFeaturesCols(value: Seq[String]): XGBoostClassifier. This method sets the feature columns for training.
    • value: a sequence of feature column names to set
    • returns the classifier itself
  • setEvalSets(evalSets: Map[String, GpuDataset]): XGBoostClassifier. This method sets eval sets for training.
    • evalSets: eval sets for training (For CPU training, the type is Map[String, DataFrame])
    • returns the classifier itself
  • fit(dataset: GpuDataset): XGBoostClassificationModel. This method triggers the training.

XGBoostClassificationModel

The full name is ml.dmlc.xgboost4j.scala.spark.XGBoostClassificationModel. It extends ProbabilisticClassificationModel[Vector, XGBoostClassificationModel].

Methods

Note: Only GPU related methods are listed below.

XGBoostRegressor

The full name is ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor. It extends Predictor[Vector, XGBoostRegressor, XGBoostRegressionModel].

Constructors
  • XGBoostRegressor(xgboostParams: Map[String, Any])
Methods

Note: Only GPU related methods are listed below.

  • setFeaturesCols(value: Seq[String]): XGBoostRegressor. This method sets the feature columns for training.
    • value: a sequence of feature column names to set
    • returns the regressor itself
  • setEvalSets(evalSets: Map[String, GpuDataset]): XGBoostRegressor. This method sets eval sets for training.
    • evalSets: eval sets for training (For CPU training, the type is Map[String, DataFrame])
    • returns the regressor itself
  • fit(dataset: GpuDataset): XGBoostRegressionModel. This method triggers the training.

XGBoostRegressionModel

The full name is ml.dmlc.xgboost4j.scala.spark.XGBoostRegressionModel. It extends PredictionModel[Vector, XGBoostRegressionModel].

Methods

Note: Only GPU related methods are listed below.