Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jvm-packages] [breaking] rework xgboost4j-spark and xgboost4j-spark-gpu #10639

Merged
merged 17 commits into from
Sep 11, 2024

Conversation

wbo4958
Copy link
Contributor

@wbo4958 wbo4958 commented Jul 25, 2024

About this PR

Features and Fixes

  • Introduce an abstract XGBoost Estimator:

    • Create an abstract base class that encapsulates common functionality for all XGBoost estimators.
    • This will promote code reuse and maintain consistency across different estimators.
  • Update to the latest XGBoost parameters

    • Add all XGBoost parameters supported in XGBoost4j-spark base on this
    • Add setter and getter for these parameters.
    • Remove the deprecated parameters
  • Address the missing value handling:

    • 0 is a meaningful value in machine learning case, the old version enforces 0 as the missing value which may result in XGBoost model inaccurate.
  • Remove any ETL operations in XGBoost

    • As a thin wrapper of XGBoost4j, it shouldn't wrap much ETL in XGBoost4j itself to make code hard to maintain and read.
  • Rework the GPU plugin:

    • Rework the GPU plugin to improve its stability and make it more readable and maintainable according to the new design pattern of xgboost4j-spark
  • Expand sanity tests for CPU and GPU consistency:

    • Increase the coverage of sanity tests to verify that the training and transform results are consistent between CPU and GPU implementations.
    • This will help identify and prevent any discrepancies or bugs related to the GPU support.

Breaking changes

  • Remove some unused parameters like Rabit related parameters and train_test_ratio. etc.

  • Separate XGBoost-spark parameter from the whole XGBoost parameters, and make the constructor of XGBoost estimator only accept the XGBoost parameters instead of parameters for xgboost4j-spark. Eg,

    val xgbParams: Map[String, Any] = Map(
          "max_depth" -> 5,
          "eta" -> 0.2,
          "objective" -> "binary:logistic"
        )
    val est = new XGBoostClassifier(xgbParams)
          .setNumRound(10)
          .setDevice("cpu")
          .setNumWorkers(2)

    or

    val est = new XGBoostClassifier(xgbParams)
          .setMaxDepth(5)
          .setEta(0.2)
          .setObjective("binary:logistic")
          .setNumRound(10)
          .setDevice("cpu")
          .setNumWorkers(2)

Test this PR

  • Unit tests For scala 2.12
mvn clean package
mvn -P gpu clean package
  • Unit tests For scala 2.13
cd xgboost
python dev/change_scala_version.py --scala-version 2.13

cd jvm-packages
mvn clean package
mvn -P gpu clean package

@wbo4958 wbo4958 changed the title [jvm-packages] rework xgboost4j-spark and xgboost4j-spark-gpu [jvm-packages] [breaking] rework xgboost4j-spark and xgboost4j-spark-gpu Jul 26, 2024
@wbo4958 wbo4958 marked this pull request as ready for review July 26, 2024 01:36
@wbo4958
Copy link
Contributor Author

wbo4958 commented Jul 26, 2024

Hi @trivialfis @hcho3 @dotbg, Could you kindly help review this PR, Thx very much.

@wbo4958
Copy link
Contributor Author

wbo4958 commented Jul 29, 2024

Hi there, The CI got passed, could you @trivialfis @hcho3 @dotbg help review this PR? Thx very much.

@hcho3
Copy link
Collaborator

hcho3 commented Jul 31, 2024

Apologies for the delay, I will look at it this Friday.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to go through this rework but I'm afraid it's beyond my expertise. Could you please update the examples and documents so that we can get a feel of what's going on with the new interface? Also, we might need to schedule an online interview, @hcho3 might want to join as well.

@hcho3
Copy link
Collaborator

hcho3 commented Aug 2, 2024

Also, we might need to schedule an online interview, @hcho3 might want to join as well.

Yes, I would love to join.

@hcho3
Copy link
Collaborator

hcho3 commented Aug 2, 2024

Some questions about build artifacts and packaging:

  • Do we continue to build two variants of libxgboost4j.so, one with CUDA and one without?
  • Now that xgboost4j-gpu package is gone, where to we put libxgboost4j.so (CUDA variant)? Does it go inside the xgboost4j-spark-gpu package?
  • How long do we plan to distribute Spark 2.12 and 2.13 variants?

@hcho3
Copy link
Collaborator

hcho3 commented Aug 2, 2024

@trivialfis Do you have a particular agenda in mind for the online interview? Would it be primarily a Q&A session? I can set up an Outlook invite to choose the best time for all of us.

@trivialfis
Copy link
Member

trivialfis commented Aug 3, 2024

For breaking changes PR, I am mostly concerned about the user interface, hence the request for documentation:

  • How does the service loader work. What's the sequence of events for a CPU pipeline and a GPU pipeline. Also, the combination of rapids SQL and XGBoost parameters (4 cases with on/off for both sides).
  • What are the necessary changes from users, can we submit deprecation warnings, or can we submit informative errors, or just hard break.
  • The sequence of events when a sparse vector is used. The memory usage profile of various types of data.
  • What are the removed ETL, how does it change user's code.
  • What migration guide is needed.
  • What are the upcoming changes.
  • What are the difference between scala and pyspark.
  • how well does it integrate with SparkML training utilities like CV.
  • Any design plan for external memory? It's taking shape in the lastest XGBoost.
  • Any plan for resilient?

And we will need to have a clear picture for both CPU and GPU. As for internal code structure, i don't have any concern. It looks readable to me and I'm fairly sure that @wbo4958 does a better job than me for JVM based projects.

If there's no way to even submit an error for breaking change, should we publish a different package on maven?

@wbo4958
Copy link
Contributor Author

wbo4958 commented Aug 4, 2024

Sorry for delay, I was on vacation these days.

@wbo4958
Copy link
Contributor Author

wbo4958 commented Aug 4, 2024

Some questions about build artifacts and packaging:

  • Do we continue to build two variants of libxgboost4j.so, one with CUDA and one without?

[Bobby] Yes, we still need to build tow libxgboost4j.so

  • Now that xgboost4j-gpu package is gone, where to we put libxgboost4j.so (CUDA variant)? Does it go inside the xgboost4j-spark-gpu package?

[Bobby] the libxgboost4j.so with cuda will reside in xgboost4j package. xgboost4j-spark-gpu now depends on xgboost4j/xgboost4j-spark

  • How long do we plan to distribute Spark 2.12 and 2.13 variants?

[Bobby] what do you mean how long?

@wbo4958
Copy link
Contributor Author

wbo4958 commented Aug 4, 2024

@trivialfis @hcho3, yeah, let's have a online review session.

@hcho3
Copy link
Collaborator

hcho3 commented Aug 4, 2024

libxgboost4j.so with cuda will reside in xgboost4j package

CPU users will complain about the size of xgboost4j package if we choose this route.

@wbo4958
Copy link
Contributor Author

wbo4958 commented Aug 5, 2024

libxgboost4j.so with cuda will reside in xgboost4j package

CPU users will complain about the size of xgboost4j package if we choose this route.

Not exactly. When compiling the CPU versions, there's no cuda things in libxgboost4j.so which will be put into public/nightly places. But for GPU version, we first compile libxgboost4j.so with CUDA and put it into xgboost4j package, but we don't distribute this xgboost4j to the maven repo, instead, we will only distribute xgboost4j-spark-gpu package to the maven, which is a uber package including xgboost4j/xgboost4j-spark.

@wbo4958
Copy link
Contributor Author

wbo4958 commented Aug 8, 2024

Hey @trivialfis @hcho3, CI passed, Could you help review it? Thx

@wbo4958 wbo4958 requested a review from trivialfis August 8, 2024 09:19
Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please help address the implication for the following items:

  • Remove any ETL operations in XGBoost

doc/jvm/xgboost_spark_migration.rst Outdated Show resolved Hide resolved
doc/jvm/xgboost_spark_migration.rst Outdated Show resolved Hide resolved
doc/jvm/xgboost_spark_migration.rst Outdated Show resolved Hide resolved
doc/jvm/index.rst Show resolved Hide resolved
doc/jvm/xgboost_spark_migration.rst Outdated Show resolved Hide resolved
doc/jvm/xgboost_spark_migration.rst Outdated Show resolved Hide resolved
doc/jvm/xgboost_spark_migration.rst Show resolved Hide resolved
doc/jvm/xgboost_spark_migration.rst Show resolved Hide resolved
Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please help address the implication for the following items:

  • Remove any ETL operations in XGBoost

@wbo4958
Copy link
Contributor Author

wbo4958 commented Aug 9, 2024

Could you please help address the implication for the following items:

  • Remove any ETL operations in XGBoost

For example,

ETL includes "caching dataset", and collecting group for ranking issue.

@trivialfis
Copy link
Member

@WeichenXu123 @rongou @Craigacp It would be great if we could get some help reviewing the PR when you are available.

Copy link
Contributor

@rongou rongou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not too familiar with this part of the code base, but overall looks good.

@Craigacp
Copy link
Contributor

I haven't written any Spark or Scala code since Spark 1.4 so I don't think I'll have much ability to review this PR.

@dotbg
Copy link
Contributor

dotbg commented Aug 10, 2024

I've started reviewing the spark-related code, but it takes a while

@wbo4958
Copy link
Contributor Author

wbo4958 commented Aug 28, 2024

Hi @trivialfis, I've fixed all the comments.

@trivialfis
Copy link
Member

trivialfis commented Aug 28, 2024

@wbo4958 Where is the definition of xgboost ranker?

@wbo4958
Copy link
Contributor Author

wbo4958 commented Aug 28, 2024

@trivialfis, I will add it in the following PR, I didn't want to blow up this PR.

Copy link

@eordentlich eordentlich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still have more to go through but here are a batch of comments (some questions).

@wbo4958 wbo4958 marked this pull request as draft September 2, 2024 10:35
@jinmfeng
Copy link

jinmfeng commented Sep 4, 2024

For ranking task, I think the library should make sure the records with the same group column go to the same task, otherwise it will impact the model performance as we describe in 9491 . It looks like this PR didn't fix this issue yet.

@trivialfis
Copy link
Member

I will look into the ranking later. In theory, as long as the partitions for each local workers are sorted, the gradient is considered correct.

@wbo4958 wbo4958 marked this pull request as ready for review September 5, 2024 03:51
@wbo4958 wbo4958 marked this pull request as draft September 9, 2024 08:07
@wbo4958 wbo4958 marked this pull request as ready for review September 9, 2024 13:09
@trivialfis
Copy link
Member

Huge thanks to @dotbg @eordentlich for the reviews and for catching some issues early. I wouldn't be able to do that myself. Could you please help approve the PR for spark hanlding?

I don't have any more comments about logic related to XGBoost itself.

Copy link

@eordentlich eordentlich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixes looks great. 2 more minor comments.

Copy link

@eordentlich eordentlich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@wbo4958
Copy link
Contributor Author

wbo4958 commented Sep 11, 2024

Hi @trivialfis, could you help merge it?

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the nice rework! Looking forward to the complete implementation.

@trivialfis trivialfis merged commit 67c8c96 into dmlc:master Sep 11, 2024
31 checks passed
@wbo4958 wbo4958 deleted the redesign-jvm branch September 17, 2024 23:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants