Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dropping CUDA 11.2 #5339

Closed
jakirkham opened this issue Jan 4, 2024 · 15 comments
Closed

Dropping CUDA 11.2 #5339

jakirkham opened this issue Jan 4, 2024 · 15 comments

Comments

@jakirkham
Copy link
Member

Am raising this issue to discuss dropping CUDA 11.2

At this stage the bulk of conda-forge CUDA packages are built on CUDA 11.8, which supports the same hardware. Also CSPs have largely moved to CUDA 11.8+

The CUDA 11.2 Docker images that we use are planned for deletion in May 2024; whereas, the CUDA 11.8 Docker images remain available

Beginning with CUDA 12.0, standard conda-forge Docker images are used and the full CUDA Toolkit is made available with Conda packages

Given this, would propose dropping CUDA 11.2 and moving to CUDA 11.8 as the new minimum in conda-forge

Please let me know what you think

@jakirkham
Copy link
Member Author

cc @conda-forge/core

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Jan 5, 2024

I still have trouble personally running tendorflow with 11.8. And as such. I am not too inclined to drop it outright, but maybe let individual feedstocks drop it?

@jakirkham
Copy link
Member Author

Saw some things over the holidays about that, but haven't quite caught up. Is there a particular place I should start reading?

In any event this isn't an immediate thing, but beginning May 2024 maintaining CUDA 11.2 will become untenable (as the NVIDIA Docker images will be deleted)

So let's start figuring out what we need to do to sunset 11.2. Right now I have...

  • TensorFlow CUDA 11.8 issues (link?)

Anything else?

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Jan 5, 2024

i see, i didn't know 11.2 was so old lol.

@h-vetinari
Copy link
Member

IMO we have to finish the CUDA 11.8 migration first, by which I don't mean #5340, but fixing & merging the open 11.8 PRs (or at least greatly reduce them).

In any event this isn't an immediate thing, but beginning May 2024 maintaining CUDA 11.2 will become untenable (as the NVIDIA Docker images will be deleted)

Our existing images won't be deleted. It'll just mean we cannot rebuild them. Granted, that's a bad place to be, but it's not like everything would break from one day to the next in May.

@jakirkham
Copy link
Member Author

IMO we have to finish the CUDA 11.8 migration first, by which I don't mean #5340, but fixing & merging the open 11.8 PRs (or at least greatly reduce them).

Had asked about what was still needed before closing out the migration last meeting. Didn't hear anything at that time. So we agreed to close it out

Both before and after that meeting spent substantial time with maintainers working to close out CUDA 11.8 PRs. The remaining feedstocks (with very few exceptions) are backlogged with multiple migration PRs and version updates. Given this, think we have likely already reach that state

That said, am happy to continue following up with maintainers who have questions on how to upgrade

Our existing images won't be deleted. It'll just mean we cannot rebuild them. Granted, that's a bad place to be, but it's not like everything would break from one day to the next in May.

As noted before when this was brought up, if the community takes this route, the community will need to pick up maintenance around CUDA 11.2

Would just add this as food for thought, it seems like we are stretched thin of late. There are significant asks of our time around using newer GLIBC, macOS minimum, NumPy 2, proposed recipe format changes, etc.. Not to mention the day-to-day work of keeping things moving smoothly. Continuing to support the long tail doesn't seem like a realistic goal

@jaimergp
Copy link
Member

jaimergp commented Feb 8, 2024

Do we have a list of feedstocks stuck in 11.2? I think it's fair to move on to 11.8 by May if we communicate the timelines with them. Like "Hey 11.2 won't be supported after May, we want to make sure you are equipped to handle the transition or, otherwise, let us know how we can help".

@jakirkham
Copy link
Member Author

For clarity the way we built for CUDA 11.2 was to restrict to 11.2 at build time and then allow any CUDA 11.2+ at runtime. So CUDA 11.8 installs are permissible with existing packages built for CUDA 11.2. IOW from an end user perspective installing packages, they would still be able to install old packages on CUDA 11.8

That said, we did run a CUDA 11.8 migrator ( #4834 ) and that ran for ~4 months until we closed it ( #5340 ). During that time I did a lot of hand holding of maintainers to update feedstocks

By the time we closed the migration, 83% of feedstocks were migrated. Another 13% had an open CUDA 11.8 PR. Here's a quick search result list of those. This accounted for 96% of feedstocks. Another 4% had no PR (IIRC this was maybe 4-5). This info use to be under the closed CUDA 11.8 migrator, but I don't see it any more (maybe @beckermr can advise on how to regenerate that info)

If we think it is reasonable and are looking for an announcement, am happy to draft one so it will show up on our new announcement page 😉

@jakirkham
Copy link
Member Author

Brought this up in the conda-forge meeting today, sounds like there are no objections to dropping

Next step will be adding an announcement that this will be dropped

Recommendation is to pick first day of work week before the Docker image is deleted so that folks have time to do any remediation

@jakirkham
Copy link
Member Author

jakirkham commented Mar 7, 2024

Do we have a list of feedstocks stuck in 11.2? I think it's fair to move on to 11.8 by May if we communicate the timelines with them. Like "Hey 11.2 won't be supported after May, we want to make sure you are equipped to handle the transition or, otherwise, let us know how we can help".

After doing a bit of digging, was able to identify when the CUDA 11.8 migration data was removed from the bot (so the last migration status):

$ git log --all -1 -- status/migration_json/cuda118.json                                      
commit eb8d2a43d20f9de828ace2c17036544e49b4546d
Author: regro-cf-autotick-bot <[email protected]>
Date:   Thu Jan 25 21:58:24 2024 +0000

    Update Graph https://github.com/regro/cf-scripts/actions/runs/7660837918

It appears the bot removed the CUDA 11.8 migration data in commit ( regro/cf-graph-countyfair@eb8d2a4 ). Looking at the CUDA 11.8 migration status before that can find a handful of feedstocks without PRs. Based on that opened the following issues:

That said, according to GitHub the last commit to main for all of these was 2 years ago. So think it is unlikely they are updated any time soon

@jakirkham
Copy link
Member Author

Next step will be adding an announcement that this will be dropped

Have submitted an announcement in PR: conda-forge/conda-forge.github.io#2098

@jakirkham
Copy link
Member Author

jakirkham commented Mar 7, 2024

Also found these CUDA 11.8 migration PRs, which are still open

In many cases they have other open migrations and version updates

Re-rendered and merged their contents into the CUDA 12 migrators that they all have open. So these can be resolved with the CUDA 12 migrations

As CUDA 11.8 is already in the CUDA build matrix, simply re-rendering any PR would add CUDA 11.8 (unless explicitly skipped in the recipe)

Edit: Added 2 more that were missed before

Edit 2: Have gone through and commented on all of these PRs noting the plan to drop CUDA 11.2 with a reference to the announcement and this issue

@jakirkham
Copy link
Member Author

Next step will be adding an announcement that this will be dropped

Have submitted an announcement in PR: conda-forge/conda-forge.github.io#2098

The announcement is now live: https://conda-forge.org/news/2024/03/06/dropping-cuda-112/

@jakirkham
Copy link
Member Author

@h-vetinari just dropped CUDA 11.2 and bumped the minimum to CUDA 11.8 in PR: #5799

Dropping CUDA 11.2 from staged-recipes with PR: conda-forge/staged-recipes#26204

Also as the CUDA minimum in conda-forge is now 11.8, dropping all CUDA images pre-11.8 in PR: conda-forge/docker-images#262

@jakirkham
Copy link
Member Author

This is now complete. Closing this out

Thanks all! 🙏

weiji14 added a commit to weiji14/deepspeed-feedstock that referenced this issue Nov 12, 2024
weiji14 added a commit to conda-forge/deepspeed-feedstock that referenced this issue Nov 12, 2024
* Remove if-branch for CUDA 11.2

No more CUDA 11.2 builds on conda-forge since conda-forge/conda-forge-pinning-feedstock#5339

* Remove ninja as runtime dependency

Xref #1

* Bump build number to 1

* Try building with EvoFormerAttention

* Try building with CUTLASS ops

* Try building with ragged device ops

* MNT: Re-rendered with conda-build 24.9.0, conda-smithy 3.44.3, and conda-forge-pinning 2024.11.11.08.59.26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants