-
-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to resume my training #1009
Comments
👋 Hello @bluesky93128, thank you for raising an issue about Ultralytics HUB 🚀! Please visit our HUB Docs to learn more:
If this is a 🐛 Bug Report, thank you for providing some details. However, we will need a bit more information to investigate this issue thoroughly. Could you provide a Minimal Reproducible Example (MRE) that includes steps, code snippets, and any relevant details about your execution environment (e.g., platform, Python version, etc.)? You can find guidance on creating an MRE here to help us understand the problem better. If this is a ❓ Question, please provide as much context as possible, including configurations, parameters, and intended outcomes, so we can assist more effectively. We’ve flagged this issue for review, and an Ultralytics engineer will assist you as soon as possible. Thank you for your patience and for helping us improve the Ultralytics HUB! 😊 |
@bluesky93128 thank you for reporting this issue and including detailed error logs. Let's address the training resumption problem first:
For the precision drop, this could be related to the resumption issue. Once we resolve the checkpoint loading problem, I recommend:
If the issue persists after trying these steps, please share:
You can find more troubleshooting guidance in our HUB Training Documentation. Let's get you back on track with your training! 🚀 |
Actually I was training this dataset using the Ultralytics' GPU service |
@bluesky93128 Looks like the model was pointing to a corrupted checkpoint (corrupted during the upload). You should be able to resume training now. |
@sergiuwaxmann |
NOTHING IS WORKING!!! I SPENT ALMOST $50 ON THIS TRAINING IT'S RIDICULOUS! I OPENED 3 BUG TICKETS, BUT NOTHING FIXED AND STILL BUGGING. PLEASE FIX AND MAKT IT WORK!!! |
@bluesky93128 I apologize for the inconvenience. It looks like the training runs out of memory, causing a crash. I see that you have an ongoing training session now — hopefully, it works this time. Please let me know if it crashes again. |
still not working please make it work. here's my model id. azkttWDuYNBz86w0nkxk |
@bluesky93128 I see that your model has finally been trained. It looks like we’re experiencing some out-of-memory (OOM) issues, and our team is actively working on a fix. Thank you for bringing this to our attention — we truly appreciate your feedback, as it helps us continuously improve the platform. To compensate for the inconvenience, we’ve added $20 to your account balance. Once again, I sincerely apologize for the disruption, and we appreciate your patience. |
Search before asking
HUB Component
No response
Bug
I'm unable to resume my training at 99
Environment
`requirements: Ultralytics requirement ['hub-sdk>=0.0.12'] not found, attempting AutoUpdate...
Collecting hub-sdk>=0.0.12
Downloading hub_sdk-0.0.18-py3-none-any.whl.metadata (10 kB)
Requirement already satisfied: requests in /usr/local/lib/python3.11/dist-packages (from hub-sdk>=0.0.12) (2.32.3)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests->hub-sdk>=0.0.12) (3.4.1)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.11/dist-packages (from requests->hub-sdk>=0.0.12) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests->hub-sdk>=0.0.12) (2.3.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.11/dist-packages (from requests->hub-sdk>=0.0.12) (2024.12.14)
Downloading hub_sdk-0.0.18-py3-none-any.whl (42 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42.9/42.9 kB 79.4 MB/s eta 0:00:00
Installing collected packages: hub-sdk
Successfully installed hub-sdk-0.0.18
requirements: AutoUpdate success ✅ 4.5s, installed 1 package: ['hub-sdk>=0.0.12']⚠️ Restart runtime or rerun command for updates to take effect
requirements:
Ultralytics HUB: New authentication successful ✅
⚠️ Download failure, retrying 1/3 https://storage.googleapis.com/ultralytics-hub.appspot.com/users/5PTzkSPQ8xUbTvODkgyS8oYjAA12/models/azkttWDuYNBz86w0nkxk/epoch-55.pt...
Ultralytics HUB: View model at https://hub.ultralytics.com/models/azkttWDuYNBz86w0nkxk 🚀
Downloading https://storage.googleapis.com/ultralytics-hub.appspot.com/users/5PTzkSPQ8xUbTvODkgyS8oYjAA12/models/azkttWDuYNBz86w0nkxk/epoch-55.pt to 'weights/epoch-55.pt'...
2025-02-02 21:14:34,719 - hub_sdk.helpers.logger - ERROR - Internal server error.
ERROR:hub_sdk.helpers.logger:Internal server error.
2025-02-02 21:14:34,725 - hub_sdk.helpers.logger - ERROR - Failed to start heartbeats: 'NoneType' object has no attribute 'json'
ERROR:hub_sdk.helpers.logger:Failed to start heartbeats: 'NoneType' object has no attribute 'json'
Exception in thread Thread-10 (_start_heartbeats):
Traceback (most recent call last):
File "/usr/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
self.run()
File "/usr/lib/python3.11/threading.py", line 982, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.11/dist-packages/hub_sdk/base/server_clients.py", line 158, in _start_heartbeats
raise e
File "/usr/local/lib/python3.11/dist-packages/hub_sdk/base/server_clients.py", line 146, in _start_heartbeats
res = self.post(endpoint, json=payload).json()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'json'
UnpicklingError Traceback (most recent call last)
in <cell line: 0>()
1 hub.login('61cbc03a08f32c5cc4fbdd65bd0990b902b34c4a4a')
2
----> 3 model = YOLO('https://hub.ultralytics.com/models/azkttWDuYNBz86w0nkxk')
4 results = model.train()
7 frames
/usr/local/lib/python3.11/dist-packages/torch/serialization.py in _legacy_load(f, map_location, pickle_module, **pickle_load_args)
1626 )
1627
-> 1628 magic_number = pickle_module.load(f, **pickle_load_args)
1629 if magic_number != MAGIC_NUMBER:
1630 raise RuntimeError("Invalid magic number; corrupt file?")
UnpicklingError: invalid load key, '<'.`
Minimal Reproducible Example
Tried on google colab, but got the above error.
Additional
No response
The text was updated successfully, but these errors were encountered: