Get Started with XGBoost4J-Spark with Jupyter Notebook

This is a getting started guide to XGBoost4J-Spark using an Jupyter notebook. At the end of this guide, the reader will be able to run a sample notebook that runs on NVIDIA GPUs.

Before you begin, please ensure that you have setup a Spark Standalone Cluster.

It is assumed that the SPARK_MASTER and SPARK_HOME environment variables are defined and point to the master spark URL (e.g. spark://localhost:7077), and the home directory for Apache Spark respectively.

Make sure you have Jupyter notebook installed.

If you install it with conda, please makes sure your Python version is consistent.
Prepare packages and dataset.

Make sure you have prepared the necessary packages and dataset by following this guide

Launch the notebook:

PYSPARK_DRIVER_PYTHON=jupyter       \
PYSPARK_DRIVER_PYTHON_OPTS=notebook \
pyspark                             \
--master ${SPARK_MASTER}            \
--conf spark.executor.extraClassPath=${CUDF_JAR}:${RAPIDS_JAR} \
--jars ${CUDF_JAR},${RAPIDS_JAR},${XGBOOST4J_JAR},${XGBOOST4J_SPARK_JAR}\
--py-files ${XGBOOST4J_SPARK_JAR},${SAMPLE_ZIP}      \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.rapids.memory.gpu.pooling.enabled=false \
--conf spark.executor.resource.gpu.amount=1 \
--conf spark.task.resource.gpu.amount=1 \
--conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \
--files $SPARK_HOME/examples/src/main/scripts/getGpusResources.sh

Then you start your notebook and open mortgage-gpu.ipynb to explore.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

python-notebook.md

python-notebook.md

Get Started with XGBoost4J-Spark with Jupyter Notebook

Files

python-notebook.md

Latest commit

History

python-notebook.md

File metadata and controls

Get Started with XGBoost4J-Spark with Jupyter Notebook