Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Various CI fixes #11196

Merged
merged 9 commits into from
Feb 4, 2025
Merged

[CI] Various CI fixes #11196

merged 9 commits into from
Feb 4, 2025

Conversation

hcho3
Copy link
Collaborator

@hcho3 hcho3 commented Jan 31, 2025

No description provided.

* Fix dmlc#10752

* [CI] Replace Mambaforge -> Miniforge3

* Fix formatting
@hcho3 hcho3 changed the title [backport] Fix dmlc/xgboost#10752 (#10972) [CI] Various CI fixes Jan 31, 2025
@hcho3 hcho3 mentioned this pull request Jan 31, 2025
5 tasks
trivialfis and others added 3 commits January 31, 2025 21:54
* Fix tests with the latest scikit-learn.

* dask.

* Remove scikit-learn pin

---------

Co-authored-by: Philip Hyunsu Cho <[email protected]>
@jakirkham
Copy link
Contributor

Seeing this error on CI:

/home/runner/work/xgboost/xgboost/dmlc-core/include/dmlc/omp.h:11:10: error: 'omp.h' file not found with <angled> include; use "quotes" instead
   11 | #include <omp.h>
      |          ^~~~~~~
      |          "omp.h"

It does appear to find OpenMP earlier in the log:

-- Found OpenMP_C: -fopenmp (found version "4.5")
-- Found OpenMP_CXX: -fopenmp (found version "4.5")
-- Found OpenMP: TRUE (found version "4.5")
-- Found OpenMP_C: -fopenmp (found version "4.5")
-- Found OpenMP_CXX: -fopenmp (found version "4.5")

So is this just a matter of using "..." instead of <...>? Or is there more to the error?

@hcho3
Copy link
Collaborator Author

hcho3 commented Feb 4, 2025

Let me backport #10987 and see if that fixes the error.

@jakirkham
Copy link
Contributor

Thanks Hyunsu! 🙏

Also there were some interesting mypy errors on CI:

xgboost/dask/__init__.py:1651: error: Function does not return a value (it only ever returns None)  [func-returns-value]
xgboost/dask/__init__.py:1661: error: Argument "data" to "predict" has incompatible type "None"; expected "Union[DaskDMatrix, Union[Array, DataFrame]]"  [arg-type]
xgboost/dask/__init__.py:1692: error: Function does not return a value (it only ever returns None)  [func-returns-value]
xgboost/dask/__init__.py:1701: error: Argument "data" to "predict" has incompatible type "None"; expected "Union[DaskDMatrix, Union[Array, DataFrame]]"  [arg-type]
Found 4 errors in 1 file (checked 40 source files)
...
/home/runner/work/xgboost/xgboost/tests/test_distributed/test_gpu_with_dask/test_gpu_with_dask.py:656: error: Function does not return a value (it only ever returns None)  [func-returns-value]
/home/runner/work/xgboost/xgboost/tests/test_distributed/test_gpu_with_dask/test_gpu_with_dask.py:661: error: Argument 3 to "predict" has incompatible type "None"; expected "Union[DaskDMatrix, Union[Array, DataFrame]]"  [arg-type]
Found 2 errors in 1 file (checked 1 source file)
mypy 1.11.2 (compiled: yes)

AFAICT the relevant functions return Any. Maybe mypy doesn't like that they don't match the exact type expected? Wonder if we need a cast

Also worth noting the test it is referencing uses asyncio

m = await xgb.dask.DaskQuantileDMatrix(client, X, y)
output = await xgb.dask.train(
client, {"tree_method": "hist", "device": "cuda"}, dtrain=m
)
with_m = await xgb.dask.predict(client, output, m)
with_X = await xgb.dask.predict(client, output, X)
inplace = await xgb.dask.inplace_predict(client, output, X)

So maybe it is just confused about asyncio usage, in which case it could be skipped in this instance

@hcho3
Copy link
Collaborator Author

hcho3 commented Feb 4, 2025

We can ignore errors from MyPy, since they are likely due to changes in the latest MyPy.

@jakirkham
Copy link
Contributor

Thanks Hyunsu! 🙏

Guessing we can ignore the C++ lints as well?

Also seeing this comm cleanup error in another CI job:

  [22:30:40] WARNING: /home/runner/work/xgboost/xgboost/src/collective/comm.cc:358: The communicator is being destroyed without a call to shutdown first. This can lead to undefined behaviour.
  [22:30:40] WARNING: /home/runner/work/xgboost/xgboost/src/collective/socket.cc:143: socket.cc(186): Failed to connect to:192.168.122.156:14345 Error:
  - [socket.h:357|22:30:40]: Socket error. system error:Connection refused
  [22:30:40] WARNING: /home/runner/work/xgboost/xgboost/src/collective/socket.cc:150: Retrying connection to 192.168.122.156 for the 1 time.
  [22:30:41] WARNING: /home/runner/work/xgboost/xgboost/src/collective/socket.cc:143: socket.cc(162): Failed to connect to:192.168.122.156:14345 Error:
  - [socket.cc:161|22:30:41]: connect failed. system error:Connection refused
  [22:30:41] WARNING: /home/runner/work/xgboost/xgboost/src/collective/comm.cc:362: 
  - [comm.cc:40|22:30:41]: Failed to connect to the tracker.
  - [socket.cc:196|22:30:41]: Failed to connect to 192.168.122.156:14345
  - [socket.cc:161|22:30:41]: connect failed. Connection refused
  Terminating due to uncaught exception 0x11b7421a000 of type dmlc::Error
  Abort trap (core dumped)

@hcho3
Copy link
Collaborator Author

hcho3 commented Feb 4, 2025

The errors from the FreeBSD job was fixed in #10756. Let me backport the fix.

@hcho3
Copy link
Collaborator Author

hcho3 commented Feb 4, 2025

The timeout error from a Dask test in https://buildkite.com/xgboost/xgboost-ci-multi-gpu/builds/7734#0194bfca-26b2-464c-9450-ae01f6369bc4 appears to be particular with the Dask version used in an old version of RAPIDS (RAPIDS 24.06, Dask 2024.5.1). I will address the failure in a follow-up PR, which will upgrade the CUDA version (to 12.8) as well as RAPIDS version (to 24.12).

@hcho3 hcho3 merged commit fc32798 into dmlc:release_2.1.0 Feb 4, 2025
25 of 29 checks passed
@hcho3 hcho3 deleted the backport_10972 branch February 4, 2025 05:15
@jakirkham
Copy link
Contributor

Thanks Hyunsu! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants