Can't resume training my dataset #997

bluesky93128 · 2025-01-24T21:53:27Z

Search before asking

I have searched the HUB issues and found no similar bug report.

HUB Component

Training, Datasets

Bug

All GPU options are not working for now.
I've tried every options in the list, but still can't resume my training.

Environment

No response

Minimal Reproducible Example

No response

Additional

No response

UltralyticsAssistant · 2025-01-24T21:53:50Z

👋 Hello @bluesky93128, thank you for raising an issue about Ultralytics HUB 🚀! Please visit our HUB Docs to learn more:

Quickstart. Start training and deploying YOLO models with HUB in seconds.
Datasets: Preparing and Uploading. Learn how to prepare and upload your datasets in YOLO format.
Projects: Creating and Managing. Group your models into projects for improved organization.
Models: Training and Exporting. Train YOLOv5 and YOLOv8 models on your custom datasets and export them to various formats for deployment.
Integrations. Explore different model integration options, such as TensorFlow, ONNX, OpenVINO, CoreML, and PaddlePaddle.
Ultralytics HUB App. Learn about the Ultralytics App for iOS and Android, enabling mobile model execution.
- iOS. Learn about YOLO CoreML models optimized for Apple's Neural Engine.
- Android. Explore TFLite acceleration on Android devices.
Inference API. Understand how to use the Inference API for running your trained models in the cloud to generate predictions.

As this appears to be a 🐛 Bug Report, could you please provide additional details, such as the specific steps to reproduce the issue? This includes:

A Minimum Reproducible Example (MRE) showing the code or configuration used.
Screenshots or logs that capture the problem.

This information will help us diagnose the issue faster 🔍! An Ultralytics engineer will review and assist with this shortly. Thank you for your patience and collaboration! 😊

bluesky93128 · 2025-01-25T00:15:02Z

#992

Please refer this ticket. Some GPUs were working, but now all gives error.

pderrenger · 2025-01-25T14:59:14Z

@bluesky93128 thank you for bringing this to our attention and referencing the related issue. If you are experiencing GPU-related errors across all options while trying to resume training, here are a few steps to help troubleshoot and resolve the issue:

Verify GPU Availability:
- Confirm that your GPUs are detected and available by running the nvidia-smi command on your local machine or checking the status of the GPUs in your environment if you're using cloud instances.
Check for Updates:
- Ensure you are using the latest version of the Ultralytics HUB platform and SDK. Updates often come with bug fixes and enhanced compatibility. You can also verify that your PyTorch installation is compatible with your CUDA version.

Resume Training Configuration:

When resuming training, ensure that the resume parameter is set correctly. For example:

from ultralytics import YOLO

# Load the checkpoint model
model = YOLO("path/to/last.pt")

# Resume training
results = model.train(resume=True)

If using the CLI:

yolo train resume model=path/to/last.pt

Reference: Resuming Interrupted Training.

Instance Selection (Cloud Training):
- If you are using Ultralytics HUB Cloud Training, ensure you've selected a compatible GPU instance (e.g., Nvidia T4). Issues like these can sometimes occur due to resource allocation problems. Try restarting your training session or selecting a different instance.
Logs for Debugging:
- If the problem persists, please share any error logs or messages you receive during the process. This will help us diagnose the issue more effectively.

If the above steps do not resolve the issue, we recommend testing on different hardware or environments to rule out compatibility issues. Additionally, feel free to share more details about your setup (e.g., dataset, model, and environment specifics). We’ll do our best to assist further. 😊

Let us know how it goes!

bluesky93128 · 2025-01-26T23:05:47Z

@pderrenger
Thanks for the comment. But I'm not running it locally, I'm using Ultralyticss' GPU service. But none of the options work.

I've tried all of these, but nothing works.

yogendrasinghx · 2025-01-27T08:46:37Z

Hi @bluesky93128,

Thank you for the update and clarification. I’ve reported this issue to the development team for further investigation. They are looking into it, and I’ll keep you updated as soon as we have a resolution.

We appreciate your patience!

bluesky93128 · 2025-01-29T09:17:59Z

Hi @yogendrasinghx
Still not working, any update?

yogendrasinghx · 2025-01-29T09:23:20Z

Hi @bluesky93128,

We’ve investigated the issue, and the development team is actively working on a fix related to low GPU availability. Most likely, the GPU you selected wasn’t available at the time of your request, but it may become available a few minutes later.

We also noticed that the previous error message wasn’t clear, so we’ve released a new version that provides a more informative message when the selected GPU is unavailable. Please try again and let us know if you continue experiencing issues.

Thanks for your patience!

sergiuwaxmann · 2025-01-29T09:26:06Z

Related issue: #998

bluesky93128 · 2025-01-29T14:20:13Z

I'm still not able to resume my training

bluesky93128 · 2025-01-29T14:29:04Z

I can see this screen, but after a while, it stops again.

yogendrasinghx · 2025-01-29T14:37:09Z

@bluesky93128

Thank you for reaching out. To help us investigate this issue further, could you please share the Model ID? You can find it in the URL when you access your model on the platform. Providing this information will allow us to locate your account and identify the issue.

Looking forward to your response so we can resolve this for you.

bluesky93128 · 2025-01-29T22:45:46Z

xt4bDTXOQFHs3t67z16w

yogendrasinghx · 2025-01-30T11:34:09Z

@bluesky93128 We’ve reverted the resume checkpoint to epoch 53 (the last successfully uploaded checkpoint), which should allow you to resume training immediately.

Please try resuming the training again.

bluesky93128 added the bug Something isn't working label Jan 24, 2025

UltralyticsAssistant added HUB Ultralytics HUB issues info needed More information is required to proceed labels Jan 24, 2025

sergiuwaxmann closed this as completed Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't resume training my dataset #997

Can't resume training my dataset #997

bluesky93128 commented Jan 24, 2025

UltralyticsAssistant commented Jan 24, 2025

bluesky93128 commented Jan 25, 2025

pderrenger commented Jan 25, 2025

bluesky93128 commented Jan 26, 2025

yogendrasinghx commented Jan 27, 2025

bluesky93128 commented Jan 29, 2025

yogendrasinghx commented Jan 29, 2025

sergiuwaxmann commented Jan 29, 2025

bluesky93128 commented Jan 29, 2025

bluesky93128 commented Jan 29, 2025

yogendrasinghx commented Jan 29, 2025

bluesky93128 commented Jan 29, 2025

yogendrasinghx commented Jan 30, 2025

Can't resume training my dataset #997

Can't resume training my dataset #997

Comments

bluesky93128 commented Jan 24, 2025

Search before asking

HUB Component

Bug

Environment

Minimal Reproducible Example

Additional

UltralyticsAssistant commented Jan 24, 2025

bluesky93128 commented Jan 25, 2025

pderrenger commented Jan 25, 2025

bluesky93128 commented Jan 26, 2025

yogendrasinghx commented Jan 27, 2025

bluesky93128 commented Jan 29, 2025

yogendrasinghx commented Jan 29, 2025

sergiuwaxmann commented Jan 29, 2025

bluesky93128 commented Jan 29, 2025

bluesky93128 commented Jan 29, 2025

yogendrasinghx commented Jan 29, 2025

bluesky93128 commented Jan 29, 2025

yogendrasinghx commented Jan 30, 2025