Unable to resume my training #1009

bluesky93128 · 2025-02-02T21:17:48Z

Search before asking

I have searched the HUB issues and found no similar bug report.

HUB Component

No response

Bug

I'm unable to resume my training at 99

Environment

`requirements: Ultralytics requirement ['hub-sdk>=0.0.12'] not found, attempting AutoUpdate...
Collecting hub-sdk>=0.0.12
Downloading hub_sdk-0.0.18-py3-none-any.whl.metadata (10 kB)
Requirement already satisfied: requests in /usr/local/lib/python3.11/dist-packages (from hub-sdk>=0.0.12) (2.32.3)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests->hub-sdk>=0.0.12) (3.4.1)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.11/dist-packages (from requests->hub-sdk>=0.0.12) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests->hub-sdk>=0.0.12) (2.3.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.11/dist-packages (from requests->hub-sdk>=0.0.12) (2024.12.14)
Downloading hub_sdk-0.0.18-py3-none-any.whl (42 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42.9/42.9 kB 79.4 MB/s eta 0:00:00
Installing collected packages: hub-sdk
Successfully installed hub-sdk-0.0.18

requirements: AutoUpdate success ✅ 4.5s, installed 1 package: ['hub-sdk>=0.0.12']
requirements: ⚠️ Restart runtime or rerun command for updates to take effect

Ultralytics HUB: New authentication successful ✅
Ultralytics HUB: View model at https://hub.ultralytics.com/models/azkttWDuYNBz86w0nkxk 🚀
Downloading https://storage.googleapis.com/ultralytics-hub.appspot.com/users/5PTzkSPQ8xUbTvODkgyS8oYjAA12/models/azkttWDuYNBz86w0nkxk/epoch-55.pt to 'weights/epoch-55.pt'...
2025-02-02 21:14:34,719 - hub_sdk.helpers.logger - ERROR - Internal server error.
ERROR:hub_sdk.helpers.logger:Internal server error.
2025-02-02 21:14:34,725 - hub_sdk.helpers.logger - ERROR - Failed to start heartbeats: 'NoneType' object has no attribute 'json'
ERROR:hub_sdk.helpers.logger:Failed to start heartbeats: 'NoneType' object has no attribute 'json'
Exception in thread Thread-10 (_start_heartbeats):
Traceback (most recent call last):
File "/usr/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
self.run()
File "/usr/lib/python3.11/threading.py", line 982, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.11/dist-packages/hub_sdk/base/server_clients.py", line 158, in _start_heartbeats
raise e
File "/usr/local/lib/python3.11/dist-packages/hub_sdk/base/server_clients.py", line 146, in _start_heartbeats
res = self.post(endpoint, json=payload).json()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'json'
⚠️ Download failure, retrying 1/3 https://storage.googleapis.com/ultralytics-hub.appspot.com/users/5PTzkSPQ8xUbTvODkgyS8oYjAA12/models/azkttWDuYNBz86w0nkxk/epoch-55.pt...

UnpicklingError Traceback (most recent call last)
in <cell line: 0>()
1 hub.login('61cbc03a08f32c5cc4fbdd65bd0990b902b34c4a4a')
2
----> 3 model = YOLO('https://hub.ultralytics.com/models/azkttWDuYNBz86w0nkxk')
4 results = model.train()

7 frames
/usr/local/lib/python3.11/dist-packages/torch/serialization.py in _legacy_load(f, map_location, pickle_module, **pickle_load_args)
1626 )
1627
-> 1628 magic_number = pickle_module.load(f, **pickle_load_args)
1629 if magic_number != MAGIC_NUMBER:
1630 raise RuntimeError("Invalid magic number; corrupt file?")

UnpicklingError: invalid load key, '<'.`

Minimal Reproducible Example

Tried on google colab, but got the above error.

Additional

No response

UltralyticsAssistant · 2025-02-02T21:18:13Z

👋 Hello @bluesky93128, thank you for raising an issue about Ultralytics HUB 🚀! Please visit our HUB Docs to learn more:

Quickstart. Start training and deploying YOLO models with HUB in seconds.
Datasets: Preparing and Uploading. Learn how to prepare and upload your datasets to HUB in YOLO format.
Projects: Creating and Managing. Group your models into projects for improved organization.
Models: Training and Exporting. Train YOLOv5 and YOLOv8 models on your custom datasets and export them to various formats for deployment.
Integrations. Explore different integration options for your trained models, such as TensorFlow, ONNX, OpenVINO, CoreML, and PaddlePaddle.
Ultralytics HUB App. Learn about the Ultralytics App for iOS and Android, which allows you to run models directly on your mobile device.
- iOS. Learn about YOLO CoreML models accelerated on Apple's Neural Engine on iPhones and iPads.
- Android. Explore TFLite acceleration on mobile devices.
Inference API. Understand how to use the Inference API for running your trained models in the cloud to generate predictions.

If this is a 🐛 Bug Report, thank you for providing some details. However, we will need a bit more information to investigate this issue thoroughly. Could you provide a Minimal Reproducible Example (MRE) that includes steps, code snippets, and any relevant details about your execution environment (e.g., platform, Python version, etc.)? You can find guidance on creating an MRE here to help us understand the problem better.

If this is a ❓ Question, please provide as much context as possible, including configurations, parameters, and intended outcomes, so we can assist more effectively.

We’ve flagged this issue for review, and an Ultralytics engineer will assist you as soon as possible. Thank you for your patience and for helping us improve the Ultralytics HUB! 😊

bluesky93128 · 2025-02-02T22:23:20Z

and one more thing is the precision drop.

pderrenger · 2025-02-03T00:09:06Z

@bluesky93128 thank you for reporting this issue and including detailed error logs. Let's address the training resumption problem first:

Checkpoint Corruption: The UnpicklingError suggests the checkpoint file might be corrupted. This can happen if the previous training session didn't save properly. You can:
- Try resuming from an earlier checkpoint if available
- Manually download the weights from your HUB model page and verify integrity
Authentication Issues: The 'NoneType' object has no attribute 'json' error might indicate an authentication problem. Ensure:
```
from ultralytics import hub
hub.login('YOUR_API_KEY')  # Refresh login
```
Package Versions: Let's verify you're using the latest versions:
```
!pip install -U ultralytics hub-sdk
```

For the precision drop, this could be related to the resumption issue. Once we resolve the checkpoint loading problem, I recommend:

Monitoring metrics through the HUB dashboard
Verifying your training configuration matches the original parameters
Checking for class imbalance or dataset issues

If the issue persists after trying these steps, please share:

The exact command/notebook cell you're using to resume training
Whether you can reproduce this with a new training session
Any additional error messages from the Colab runtime logs

You can find more troubleshooting guidance in our HUB Training Documentation.

Let's get you back on track with your training! 🚀

bluesky93128 · 2025-02-03T00:59:53Z

Actually I was training this dataset using the Ultralytics' GPU service
but it always giving error when I try to train
I've trained 2 models with the same dataset, but their accuracy is too low( was suspended too much while training ), so I tried again with higher image size, but it gives errors again.
what's wrong with your service?
why are they so buggy?

sergiuwaxmann · 2025-02-03T09:10:48Z

@bluesky93128 Looks like the model was pointing to a corrupted checkpoint (corrupted during the upload). You should be able to resume training now.

bluesky93128 · 2025-02-03T11:33:39Z

@sergiuwaxmann
I've tried again, but it's still not working.
Could you please look into why the service is saving a corrupted checkpoint? I believe this issue, along with the service suspension problem, should be addressed. Thank you!

bluesky93128 · 2025-02-03T13:04:36Z

@sergiuwaxmann @pderrenger

NOTHING IS WORKING!!!

I SPENT ALMOST $50 ON THIS TRAINING

IT'S RIDICULOUS!

I OPENED 3 BUG TICKETS, BUT NOTHING FIXED AND STILL BUGGING.

PLEASE FIX AND MAKT IT WORK!!!

sergiuwaxmann · 2025-02-03T13:14:46Z

@bluesky93128 I apologize for the inconvenience.

It looks like the training runs out of memory, causing a crash.

I see that you have an ongoing training session now — hopefully, it works this time. Please let me know if it crashes again.

bluesky93128 · 2025-02-03T22:43:15Z

@sergiuwaxmann

still not working

please make it work. here's my model id.

azkttWDuYNBz86w0nkxk

sergiuwaxmann · 2025-02-04T09:36:24Z

@bluesky93128 I see that your model has finally been trained.

It looks like we’re experiencing some out-of-memory (OOM) issues, and our team is actively working on a fix. Thank you for bringing this to our attention — we truly appreciate your feedback, as it helps us continuously improve the platform.

To compensate for the inconvenience, we’ve added $20 to your account balance.

Once again, I sincerely apologize for the disruption, and we appreciate your patience.

bluesky93128 added the bug Something isn't working label Feb 2, 2025

UltralyticsAssistant added HUB Ultralytics HUB issues info needed More information is required to proceed labels Feb 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to resume my training #1009

Unable to resume my training #1009

bluesky93128 commented Feb 2, 2025

UltralyticsAssistant commented Feb 2, 2025

bluesky93128 commented Feb 2, 2025

pderrenger commented Feb 3, 2025

bluesky93128 commented Feb 3, 2025

sergiuwaxmann commented Feb 3, 2025

bluesky93128 commented Feb 3, 2025

bluesky93128 commented Feb 3, 2025

sergiuwaxmann commented Feb 3, 2025

bluesky93128 commented Feb 3, 2025

sergiuwaxmann commented Feb 4, 2025

Unable to resume my training #1009

Unable to resume my training #1009

Comments

bluesky93128 commented Feb 2, 2025

Search before asking

HUB Component

Bug

Environment

Minimal Reproducible Example

Additional

UltralyticsAssistant commented Feb 2, 2025

bluesky93128 commented Feb 2, 2025

pderrenger commented Feb 3, 2025

bluesky93128 commented Feb 3, 2025

sergiuwaxmann commented Feb 3, 2025

bluesky93128 commented Feb 3, 2025

bluesky93128 commented Feb 3, 2025

sergiuwaxmann commented Feb 3, 2025

bluesky93128 commented Feb 3, 2025

sergiuwaxmann commented Feb 4, 2025