-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent timeouts from MSA server #664
Comments
This may be a related issue. I am running ColabFold in personal Google account. Here is an example of log code Downloading alphafold2_multimer_v3 weights to .: 100%|██████████| 3.82G/3.82G [00:29<00:00, 138MB/s] The above exception was the direct cause of the following exception: Traceback (most recent call last): |
Can you please try --host-url "https://api-105.colabfold.com" This is exactly the same MSA Server, only the route to the server is different. Please report back if this is also resulting in issues or not. |
Thank you for the help! I've tried running colabfold_batch with the above host-url; unfortunately, it seems like the same issue is still present (irregular timeouts, seemingly inconsistent even for the same sequence submission). |
Thanks for sharing. I got 404 error on https://api-105.colabfold.com but I'm guessing the Colab code should have a modification to use this server but I could not find a place to call this in the code . I'm not a programmer at all. |
After running on "https://api-105.colabfold.com/" for a bit longer, I'm noticing that the timeout frequency is much higher (~90% of requests fail now, compared to what roughly seems like ~50% for the original server). |
Thanks for sharing. I have been experiencing the same problem since last Friday and have still not been able to resolve it. If you have any solutions, I would appreciate it if you could let us know. 2024-11-21 08:57:23,690 Error while fetching result from MSA server. Retrying... (1/5) |
I tried to deploy another workaround. Please let me know if there are still issues with the default |
After running a bit more on the default server, it seems to be a bit more consistent now, though I do still have intermittent timeouts/failures. I would say about 80-90% of requests succeed now using the default api.colabfold.com host. |
I'm not sure this will help. the error code below results only when I select PDB100 for template_mode . The problem is that this crashes every run I try to make. As such I can't reproduce a PPI that was highly confident before what ever change happened circa Nov 11-20. IndexError Traceback (most recent call last) IndexError: list index out of range |
The IT team didn't get back to us before the weekend. I hope we can do something about this issue on Monday. |
As an update to this: the server seems to be timing out a lot less often now (no failures in about 10 attempts so far), though it can take a while for it to return results (over 10 minutes in one case). This is definitely preferable to the previous case where it would fail intermittently, though it does seem to be a bit slower than it used to be. |
I deployed a new workaround on Sunday. I don't expect any failures anymore. The reduced speed is a bit surprising though. Is it the download speed or job throughput? I reduced the job "token recovery" to one new job token restored every 100 seconds, compared to 90 seconds before. |
Working now ! Thanks |
It seems like it's mostly the speed, though I'm not totally sure. Logs seem to imply that a query was sent and that the server is processing for quite a bit of time (though I'm not sure how to interpret the pending/running commands here). This takes longer than the job token being restored, though it may be a combination of factors. Example below: 2024-11-25 23:37:36,771 Running colabfold 1.5.5 --snipped out many "RUNNING" lines for brevity-- 2024-11-25 23:45:31,755 Sleeping for 9s. Reason: RUNNING |
Thanks for the eagle eyes @rachitk |
Ah wait, this is a monomer. This shouldn't be affected. This is indeed suprisingly slow. I would have expected something in the order of ~1 minute to process this (total time spent in RUNNING). |
Sorry for the delay in responding - was out for the holidays in the US. I've submitted a few new (monomer) jobs since the last comment here and it seems that none of them have failed and that most of the queries return within 2-3 minutes (usually <1 minute) now. Generally, the performance seems to be back to normal. Happy to run any additional tests if needed, but I think the underlying issues (intermittent failures and later slower responses) have largely been resolved from my tests. Thank you so much for all the help and for deploying a fix so quickly re: the intermittent failures! |
Thank you so much again for making this resource available!
This is basically a duplicate of #646 but updated since the issues are much more intermittent.
Expected Behavior
Consistently receive MSA responses.
Current Behavior
Intermittent timeouts when trying to query the MSA server - sometimes retrying with the same sequence will work.
Steps to Reproduce (for bugs)
When trying to run colabfold-batch, I will randomly sometimes get errors like the below and sometimes will not.
ColabFold Output (for bugs)
Log
Context
curl https://api.colabfold.com/queue
returnsYour Environment
I am currently running ColabFold locally using the most recent docker container (1.5.5) (https://github.com/sokrypton/ColabFold/wiki/Running-ColabFold-in-Docker) on a cluster system running CentOS7, with a Tesla T4 GPU. The cluster does have access to the internet.
The text was updated successfully, but these errors were encountered: