This is a getting started guide to XGBoost4J-Spark using an Jupyter notebook. At the end of this guide, the reader will be able to run a sample notebook that runs on NVIDIA GPUs.
Before you begin, please ensure that you have setup a Spark Standalone Cluster.
It is assumed that the SPARK_MASTER
and SPARK_HOME
environment variables are defined and point to the master spark URL (e.g. spark://localhost:7077
), and the home directory for Apache Spark respectively.
-
Make sure you have Jupyter notebook installed.
If you install it with conda, please makes sure your Python version is consistent.
-
Prepare packages and dataset.
Make sure you have prepared the necessary packages and dataset by following this guide
-
Launch the notebook:
PYSPARK_DRIVER_PYTHON=jupyter \ PYSPARK_DRIVER_PYTHON_OPTS=notebook \ pyspark \ --master ${SPARK_MASTER} \ --conf spark.executor.extraClassPath=${CUDF_JAR}:${RAPIDS_JAR} \ --jars ${CUDF_JAR},${RAPIDS_JAR},${XGBOOST4J_JAR},${XGBOOST4J_SPARK_JAR}\ --py-files ${XGBOOST4J_SPARK_JAR},${SAMPLE_ZIP} \ --conf spark.plugins=com.nvidia.spark.SQLPlugin \ --conf spark.rapids.memory.gpu.pooling.enabled=false \ --conf spark.executor.resource.gpu.amount=1 \ --conf spark.task.resource.gpu.amount=1 \ --conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \ --files $SPARK_HOME/examples/src/main/scripts/getGpusResources.sh
Then you start your notebook and open
mortgage-gpu.ipynb
to explore.