Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't resume training my dataset #997

Closed
1 task done
bluesky93128 opened this issue Jan 24, 2025 · 13 comments
Closed
1 task done

Can't resume training my dataset #997

bluesky93128 opened this issue Jan 24, 2025 · 13 comments
Labels
bug Something isn't working HUB Ultralytics HUB issues info needed More information is required to proceed

Comments

@bluesky93128
Copy link

Search before asking

  • I have searched the HUB issues and found no similar bug report.

HUB Component

Training, Datasets

Bug

All GPU options are not working for now.
I've tried every options in the list, but still can't resume my training.

Environment

No response

Minimal Reproducible Example

No response

Additional

No response

@bluesky93128 bluesky93128 added the bug Something isn't working label Jan 24, 2025
@UltralyticsAssistant UltralyticsAssistant added HUB Ultralytics HUB issues info needed More information is required to proceed labels Jan 24, 2025
@UltralyticsAssistant
Copy link
Member

👋 Hello @bluesky93128, thank you for raising an issue about Ultralytics HUB 🚀! Please visit our HUB Docs to learn more:

  • Quickstart. Start training and deploying YOLO models with HUB in seconds.
  • Datasets: Preparing and Uploading. Learn how to prepare and upload your datasets in YOLO format.
  • Projects: Creating and Managing. Group your models into projects for improved organization.
  • Models: Training and Exporting. Train YOLOv5 and YOLOv8 models on your custom datasets and export them to various formats for deployment.
  • Integrations. Explore different model integration options, such as TensorFlow, ONNX, OpenVINO, CoreML, and PaddlePaddle.
  • Ultralytics HUB App. Learn about the Ultralytics App for iOS and Android, enabling mobile model execution.
    • iOS. Learn about YOLO CoreML models optimized for Apple's Neural Engine.
    • Android. Explore TFLite acceleration on Android devices.
  • Inference API. Understand how to use the Inference API for running your trained models in the cloud to generate predictions.

As this appears to be a 🐛 Bug Report, could you please provide additional details, such as the specific steps to reproduce the issue? This includes:

  1. A Minimum Reproducible Example (MRE) showing the code or configuration used.
  2. Screenshots or logs that capture the problem.

This information will help us diagnose the issue faster 🔍! An Ultralytics engineer will review and assist with this shortly. Thank you for your patience and collaboration! 😊

@bluesky93128
Copy link
Author

#992

Please refer this ticket. Some GPUs were working, but now all gives error.

@pderrenger
Copy link
Member

@bluesky93128 thank you for bringing this to our attention and referencing the related issue. If you are experiencing GPU-related errors across all options while trying to resume training, here are a few steps to help troubleshoot and resolve the issue:

  1. Verify GPU Availability:

    • Confirm that your GPUs are detected and available by running the nvidia-smi command on your local machine or checking the status of the GPUs in your environment if you're using cloud instances.
  2. Check for Updates:

    • Ensure you are using the latest version of the Ultralytics HUB platform and SDK. Updates often come with bug fixes and enhanced compatibility. You can also verify that your PyTorch installation is compatible with your CUDA version.
  3. Resume Training Configuration:

    • When resuming training, ensure that the resume parameter is set correctly. For example:
      from ultralytics import YOLO
      
      # Load the checkpoint model
      model = YOLO("path/to/last.pt")
      
      # Resume training
      results = model.train(resume=True)
      If using the CLI:
      yolo train resume model=path/to/last.pt
      
      Reference: Resuming Interrupted Training.
  4. Instance Selection (Cloud Training):

    • If you are using Ultralytics HUB Cloud Training, ensure you've selected a compatible GPU instance (e.g., Nvidia T4). Issues like these can sometimes occur due to resource allocation problems. Try restarting your training session or selecting a different instance.
  5. Logs for Debugging:

    • If the problem persists, please share any error logs or messages you receive during the process. This will help us diagnose the issue more effectively.

If the above steps do not resolve the issue, we recommend testing on different hardware or environments to rule out compatibility issues. Additionally, feel free to share more details about your setup (e.g., dataset, model, and environment specifics). We’ll do our best to assist further. 😊

Let us know how it goes!

@bluesky93128
Copy link
Author

@pderrenger
Thanks for the comment. But I'm not running it locally, I'm using Ultralyticss' GPU service. But none of the options work.

Image

I've tried all of these, but nothing works.

@yogendrasinghx
Copy link
Member

Hi @bluesky93128,

Thank you for the update and clarification. I’ve reported this issue to the development team for further investigation. They are looking into it, and I’ll keep you updated as soon as we have a resolution.

We appreciate your patience!

@bluesky93128
Copy link
Author

Hi @yogendrasinghx
Still not working, any update?

@yogendrasinghx
Copy link
Member

Hi @bluesky93128,

We’ve investigated the issue, and the development team is actively working on a fix related to low GPU availability. Most likely, the GPU you selected wasn’t available at the time of your request, but it may become available a few minutes later.

We also noticed that the previous error message wasn’t clear, so we’ve released a new version that provides a more informative message when the selected GPU is unavailable. Please try again and let us know if you continue experiencing issues.

Thanks for your patience!

@sergiuwaxmann
Copy link
Member

Related issue: #998

@bluesky93128
Copy link
Author

I'm still not able to resume my training

@bluesky93128
Copy link
Author

Image

I can see this screen, but after a while, it stops again.

@yogendrasinghx
Copy link
Member

@bluesky93128

Thank you for reaching out. To help us investigate this issue further, could you please share the Model ID? You can find it in the URL when you access your model on the platform. Providing this information will allow us to locate your account and identify the issue.

Looking forward to your response so we can resolve this for you.

@bluesky93128
Copy link
Author

xt4bDTXOQFHs3t67z16w

@yogendrasinghx
Copy link
Member

@bluesky93128 We’ve reverted the resume checkpoint to epoch 53 (the last successfully uploaded checkpoint), which should allow you to resume training immediately.

Please try resuming the training again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working HUB Ultralytics HUB issues info needed More information is required to proceed
Projects
None yet
Development

No branches or pull requests

5 participants