-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MIgration failures #605
Comments
this refers to these two user problem reports, there may be more: |
I was too optimistic. {'migration_status': 9, 'create_by': '[email protected]', 'migration_url': 'https://cmsweb.cern.ch/dbs/prod/global/DBSReader', 'last_modified_by': '[email protected]', 'creation_date': 1557774029, 'retry_count': 3, 'migration_input': '/NonPrD0_pT-1p2_y-2p4_pp_13TeV_pythia8/RunIILowPUAutumn18DR-102X_upgrade2018_realistic_v15-v1/AODSIM#4a4e2ab5-1acc-4fae-9b5a-83c85ae7ffe4', 'migration_request_id': 2738818, 'last_modification_date': 1557775662} |
and this is all that the logs which I can find have to say about it:
|
Let me check.
From: Stefano Belforte <[email protected]>
Reply-To: dmwm/DBS <[email protected]>
Date: Monday, May 13, 2019 at 3:07 PM
To: dmwm/DBS <[email protected]>
Cc: Subscribed <[email protected]>
Subject: Re: [dmwm/DBS] MIgration failures (#605)
and this is all that the logs which I can find have to say about it:
belforte@vocms055/srv-logs> grep 2738818 */dbsmigration/dbsmigration-20190513.log
vocms0136/dbsmigration/dbsmigration-20190513.log:--------------------getResource-- Mon May 13 19:13:38 2019 Migration request ID: 2738818
vocms0163/dbsmigration/dbsmigration-20190513.log:--------------------getResource-- Mon May 13 19:00:33 2019 Migration request ID: 2738818
vocms0163/dbsmigration/dbsmigration-20190513.log:--------------------getResource-- Mon May 13 19:06:35 2019 Migration request ID: 2738818
vocms0165/dbsmigration/dbsmigration-20190513.log:--------------------getResource-- Mon May 13 19:22:41 2019 Migration request ID: 2738818
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_DBS_issues_605-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAANROTV3YU7E3NAMLP3KBQLPVHDBBA5CNFSM4HMS5T3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVJNDIQ-23issuecomment-2D491966882&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=8bursUuc0V63OwREQMBG2Q&m=dSUnEeUU2Hoxb-ySCz1qRIaHgWUyFgjvPQEl2mxoM80&s=fv78Oj5xlW4DXL_F0gB-CerwcuDHYuZz_KI0a6hN1D8&e=>, or mute the thread<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AANROTUQGQ3G7B4FYERPWJ3PVHDBBANCNFSM4HMS5T3A&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=8bursUuc0V63OwREQMBG2Q&m=dSUnEeUU2Hoxb-ySCz1qRIaHgWUyFgjvPQEl2mxoM80&s=PiwKxJSC_mgdNiaKeZNLzUNJ_BfazIYq8vWY2t-wVFg&e=>.
|
It looked like to me it was timeout. |
I reenable the migration. Let's see if it can be done this time. |
sorry @yuyiguo I do not know what you mean by re-enable. |
Tried another block, failed again: In [60]: status = apiMig.statusMigration(migration_rqst_id=id) In [61]: status[0] In [62]: |
Looking at ASO logs, all migrations are failing today :-( |
As I wrote to the users, DBS migration failures has been 'unheard of' since ever, so we do not have a counter or anything. I have to grep log files like
|
I have submitted a new migration request after the rollback, let's see: In [68]: status = apiMig.statusMigration(migration_rqst_id=id) In [69]: status[0] |
Sorry Stefano I did not explain what I meant by "reenabled". It is that I changed the database status from 9 to 0 in the db. So that migration requests can be reprocessed by the migration server. 9 means permanently failed. |
OK. Not something I have an API for, anyhow should be the same as the migration delete + a new request which I did. We can wait for the 3 retries, but it dos not look promising. |
I reenabled yesterday's migration again. The problem is that DBS global reader is in high load. When we migration from global to phys03, the migration server will use DBS API to read the block from global just like anyone else. If the block is big, it will be timeout . |
It failed again. |
I can understand that, but would like to see confirmation in logs of this timeout. I can look at block size but maybe we need to also include parents. OTOH load should not be very high now, is it ? |
The thing we need to figure out is which blockdump call is from migration server of phys03 in dbsGlobalReader logs. Let me check. |
I am working on it. Please don't restart or delete these two migrations. |
OK. thanks. |
let me know if I can help |
INFO:cherrypy.access:[14/May/2019:16:12:06] vocms0163.cern.ch 2001:1458:201:a8::100:1a9 "GET /dbs/prod/global/DBSReader/blockdump?block_name=%2FHIMinimumBias9%2FHIRun2018A-v1%2FRAW%232ad44aae-509b-4554-be3f-6802f1d66056 HTTP/1.1" 500 Internal Server Error [data: - in 135 out 200100579 us ] [auth: OK "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=lijimene/CN=817301/CN=Lina Marcela Jimenez Becerra" "" ] [ref: "" "PycURL/7.19.3 libcurl/7.35.0 OpenSSL/1.0.1r zlib/1.2.8 c-ares/1.10.0" ] IS this in the time frame you were doing? |
no. THis must be a test from Lina |
ah sorry.... ignore. I misread the question. |
last example I posted here is there musts be a place in hell for those who write timestamps in logs w/o a timezone |
the 500 error may related to the redeployment script. See here https://gitlab.cern.ch/cms-http-group/doc/issues/155#note_2584183 |
i submitted the migration request well after DBS was restarted this morning. |
Migration status fro ASO Publisher today: |
OK. I understand now your point about https://gitlab.cern.ch/cms-http-group/doc/issues/155#note_2584183 . Let's see how it goes once that is solved. Thanks |
Yes, let's wait for Lina to restart DBS with the new configuration. Then I will reenable one of the migration and see if we can get it going. |
It did not look good, see below. The migration server got timeout while download the block from the source/DBSGlobalReader. This is after the last restart at 15:00 GMT.
|
Yes, Alan. |
Here it is: cms-sw/cmsdist#4983 (untested :D) |
It looks good to me. Thanks Alan. |
@bbockelm @amaltaro Individual DBS API fix may soon finding the same problem pop up from other places. The most heavy DBS APIs will do more than one sql calls and depend on the previous result. DBS is the most heavy user here. If a fix works for DBS, It should do for others too. |
@yuyiguo Yuyi are you making reference to cms-sw/cmsdist#4983 ? |
As a follow up of this 3 digits long GH thread, I created this WMCore to be addressed in the beginning of the next week: dmwm/WMCore#9207 |
+1Sorry.bad morning for work sent from Stefano's mobileOn May 17, 2019 09:06, Brian P Bockelman <[email protected]> wrote:@amaltaro - I think the sequence should be:
Rollback sqlalchemy. Get CRAB working ASAP.DBS-specific fix as suggested in #606Larger discussion, preferably not through a million GitHub threads :)
—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or mute the thread.
|
@yuyiguo Yuyi, I've built tag HG1905j including cms-sw/cmsdist#4983, and I've deploy it on testbed. I'll keep an eye on validation results, in order to coordinate where it should go to production if tests are successful. Best Regards, Lina. |
thanks all. I did just a quick check on blockDump on cmsweb-testbed (which was timing out yesterday) and it works in s couple of seconds now :-) |
@h4d4 I did validation tests. Everything is good for the cmsweb-testbed, but migration. The problem is that for testing the migration, I need a source DBS and a destination DBS. In the cmsweb-testbed testing, the testbed is destination DBS. In regular case, the source DBS will be dbs global, however the DBS global does not work for blockdump due to the SQLAlchemy version. So I switched to my VM as the source DBS, I got errors that the migration server could not build the migration list. After some debug with a few times of restarting migration server on cmsweb-testbed with some prints, I found the reason was "Failed to connect to dbs3-test2.cern.ch port 8443: No route to host". That was kind surprised to me. I know my VM works when I do everything on the node. So I did a simple testing by log in lxplus and curl to DBS server on my VM . I got the same connection error as the migration server on cmsweb-testbed. I am not sure if this is a feature of new VM or cmsweb. I recalled that I was able to access dbs servers on personal VMs as long as I was inside cern in the past. Then I tried to open the 8443/443 port, but failed. However, I did test the blockdump on cmsweb-testbed and worked well. The blockdump caused the past migration failure. At this point, I don't know what we need to do with the deployment on prod. If we want to continue for full validation, @vkuznet Valentin or others may help on opening ports issue. Then I will test again. If the current test is enough, you may deploy on production. |
Well if we deploy on production and migration still fails, we are no worse off than we are now. May very well try. |
@yuyiguo how about triggering a migration from testbed to prod (given that the failure happens at the reader side)? If you want, I have a few datasets available only in testbed, e.g.: Otherwise, I'm sending (just did) you a trick via private email that should allow you to connect to services running on your VM. Don't forget you need to allow your DN in the authmap.json file (check your crontab entry to see what I'm talking about) |
Reading the latest inputs, my understanding is that there is still a need to make additional tests. I'll keep an eye on it. |
Yuyi,
I think the default behavior now for machine tools is to connect to port 8443.
We had a campaign about it an year ago and you implemented this
change in your DBS client APIs, see
https://github.com/dmwm/DBS/blob/master/Client/src/python/dbs/apis/dbsClient.py#L140
So you have 2 options, either to adjust your client to use port 443 and then I
would expect everything will work or you need to add new iptables rules to open port
8443 on your VM. The rule
sudo iptables -I INPUT -i eth0 -p tcp -s $ipaddr --dport $port -j ACCEPT
where you need to change ipaddr and port accordingly.
Then you may need to save the rule
sudo iptables save
sudo iptables reload
Please pay attention that VM may be using puppet which runs its cron and rules
may be wiped out because puppet cron will put back according to puppet
configuration. In that case the rules should be changed in puppet configuration.
Best,
Valentin.
…On 0, Yuyi Guo ***@***.***> wrote:
@h4d4
Lina,
I did validation tests. Everything is good for the cmsweb-testbed, but migration.
The problem is that for testing the migration, I need a source DBS and a destination DBS. In the cmsweb-testbed testing, the testbed is destination DBS. In regular case, the source DBS will be dbs global, however the DBS global does not work for blockdump due to the SQLAlchemy version. So I switched to my VM as the source DBS, I got errors that the migration server could not build the migration list. After some debug with a few times of restarting migration server on cmsweb-testbed with some prints, I found the reason was "Failed to connect to dbs3-test2.cern.ch port 8443: No route to host". That was kind surprised to me. I know my VM works when I do everything on the node. So I did a simple testing by log in lxplus and curl to DBS server on my VM . I got the same connection error as the migration server on cmsweb-testbed.
I am not sure if this is a feature of new VM or cmsweb. I recalled that I was able to access dbs servers on personal VMs as long as I was inside cern in the past. Then I tried to open the 8443/443 port, but failed. However, I did test the blockdump on cmsweb-testbed and worked well. The blockdump caused the past migration failure.
At this point, I don't know what we need to do with the deployment on prod. If we want to continue for full validation, @vkuznet Valentin or others may help on opening ports issue. Then I will test again. If the current test is enough, you may deploy on production.
Thanks,
Yuyi
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#605 (comment)
|
@h4d4 |
@yuyiguo Yuyi, All, Are you all agree in scheduling this intervention for tomorrow early in the morning? I need to announce it on hn-cms-cerncompannounce Best Regards, Lina. |
Tomorrow morning will be OK. Better not do changes to production late in the day. Big thanks to all of you ! |
@yuyiguo Yuyi, All, Thanks for the feedback, therefore I'll run a production intervention for DBS's servers tomorrow at 9:00 AM GVA Time. I'm going to send the announcement to 'hn-cms-cerncompannounce'. Best Regards, Lina. |
@h4d4 |
@yuyiguo Yuyi,
Best Regards, Lina. |
Lina, I'm not sure I understand what you're proposing. We run the production intervention tomorrow morning, then once it's done; you start working on the testbed new release. Am I missing anything? |
The question is: who is using the DBS Oracle INT instance in testbed ? |
I suggest we discuss the prod --> int DB copy/dump in another thread not to mix too many things here. And yes, WMCore team is using the cmsweb-testbed int database almost daily. |
Sorry to add more problems.
There have been a few, persistent, failures to migrate datasets from Global to Phys03. While it makes surely sense that migration can fail when Global can't be read, I'd like to see what exactly went wrong, if nothing else to know when to try again.
I lookeed at DBSmigration logs via vocms055, but they have no useful information, only a series of time stamps. Even when grepping for a given (failed) migration request id, I found only one line liting the Id, but no detail.
Is there maybe a verbosity level that could be changed, temporarely ?
examples of Dataset which failed to migrate:
/NonPrD0_pT-1p2_y-2p4_pp_13TeV_pythia8/RunIILowPUAutumn18DR-102X_upgrade2018_realistic_v15-v1/AODSIM
/DYJetsToLL_M-50_TuneCUETHS1_13TeV-madgraphMLM-herwigpp/RunIISummer16MiniAODv3-PUMoriond17_94X_mcRun2_asymptotic_v3-v2/MINIAODSIM
The text was updated successfully, but these errors were encountered: