Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MIgration failures #605

Closed
belforte opened this issue May 13, 2019 · 203 comments · May be fixed by #606
Closed

MIgration failures #605

belforte opened this issue May 13, 2019 · 203 comments · May be fixed by #606

Comments

@belforte
Copy link
Member

Sorry to add more problems.
There have been a few, persistent, failures to migrate datasets from Global to Phys03. While it makes surely sense that migration can fail when Global can't be read, I'd like to see what exactly went wrong, if nothing else to know when to try again.

I lookeed at DBSmigration logs via vocms055, but they have no useful information, only a series of time stamps. Even when grepping for a given (failed) migration request id, I found only one line liting the Id, but no detail.

Is there maybe a verbosity level that could be changed, temporarely ?

examples of Dataset which failed to migrate:

/NonPrD0_pT-1p2_y-2p4_pp_13TeV_pythia8/RunIILowPUAutumn18DR-102X_upgrade2018_realistic_v15-v1/AODSIM

/DYJetsToLL_M-50_TuneCUETHS1_13TeV-madgraphMLM-herwigpp/RunIISummer16MiniAODv3-PUMoriond17_94X_mcRun2_asymptotic_v3-v2/MINIAODSIM

@belforte
Copy link
Member Author

@belforte
Copy link
Member Author

I was too optimistic.
Migrations keep failing even now that DBS is working, how can we debug/solve this ?
Example (I submitted this half an hour ago or so):

{'migration_status': 9, 'create_by': '[email protected]', 'migration_url': 'https://cmsweb.cern.ch/dbs/prod/global/DBSReader', 'last_modified_by': '[email protected]', 'creation_date': 1557774029, 'retry_count': 3, 'migration_input': '/NonPrD0_pT-1p2_y-2p4_pp_13TeV_pythia8/RunIILowPUAutumn18DR-102X_upgrade2018_realistic_v15-v1/AODSIM#4a4e2ab5-1acc-4fae-9b5a-83c85ae7ffe4', 'migration_request_id': 2738818, 'last_modification_date': 1557775662}

@belforte
Copy link
Member Author

and this is all that the logs which I can find have to say about it:

belforte@vocms055/srv-logs> grep 2738818 */dbsmigration/dbsmigration-20190513.log
vocms0136/dbsmigration/dbsmigration-20190513.log:--------------------getResource--  Mon May 13 19:13:38 2019 Migration request ID: 2738818
vocms0163/dbsmigration/dbsmigration-20190513.log:--------------------getResource--  Mon May 13 19:00:33 2019 Migration request ID: 2738818
vocms0163/dbsmigration/dbsmigration-20190513.log:--------------------getResource--  Mon May 13 19:06:35 2019 Migration request ID: 2738818
vocms0165/dbsmigration/dbsmigration-20190513.log:--------------------getResource--  Mon May 13 19:22:41 2019 Migration request ID: 2738818

@yuyiguo
Copy link
Member

yuyiguo commented May 13, 2019 via email

@yuyiguo
Copy link
Member

yuyiguo commented May 13, 2019

It looked like to me it was timeout.
I am trying to reenable it.

@yuyiguo
Copy link
Member

yuyiguo commented May 13, 2019

I reenable the migration. Let's see if it can be done this time.

@belforte
Copy link
Member Author

sorry @yuyiguo I do not know what you mean by re-enable.
What I did this morning was to remove existing failed migration request via removeMigration API
and submit a new one, which eventually failed again.
Do you mean that I should submit it again, or that you did this already ?

@belforte
Copy link
Member Author

Tried another block, failed again:

In [60]: status = apiMig.statusMigration(migration_rqst_id=id)

In [61]: status[0]
Out[61]:
{'create_by': '[email protected]',
'creation_date': 1557801649,
'last_modification_date': 1557801952,
'last_modified_by': '[email protected]',
'migration_input': '/NonPrD0_pT-1p2_y-2p4_pp_13TeV_pythia8/RunIILowPUAutumn18DR-102X_upgrade2018_realistic_v15-v1/AODSIM#07f3510e-6c05-4546-9d54-af9f646fc910',
'migration_request_id': 2738841,
'migration_status': 3,
'migration_url': 'https://cmsweb.cern.ch/dbs/prod/global/DBSReader',
'retry_count': 0}

In [62]:

@belforte
Copy link
Member Author

Looking at ASO logs, all migrations are failing today :-(
Am checking past days now.

@belforte
Copy link
Member Author

As I wrote to the users, DBS migration failures has been 'unheard of' since ever, so we do not have a counter or anything. I have to grep log files like zgrep -i migrat stderr.log-20190506.gz|grep "Migration id"|cut -d: -f 9|grep terminally . So sorry if I do not paste here a way for others to reproduce, but here's a count by day of number of migration requests ended with given status. I only count terminal statuses.
I only have logs since Apr 24.

day       success    fail
Apr 24     73         0
Apr 25    326         0
Apr 26   1280         0
Apr 27     38         0
Apr 28     22         0
Apr 29    139         0
Apr 30    260         1
May  1    191         2
May  2     86         3
May  3    179         4
May  4    201         8
May  5     40         3
May  6    218         3
May  7     94         2
May  8      0        40
May  9      3       392
May 10      0       413
May 11      0       840
May 12      0       326
May 13      0       352
May 14      1       297

@belforte
Copy link
Member Author

I have submitted a new migration request after the rollback, let's see:

In [68]: status = apiMig.statusMigration(migration_rqst_id=id)

In [69]: status[0]
Out[69]:
{'create_by': '[email protected]',
'creation_date': 1557837005,
'last_modification_date': 1557837005,
'last_modified_by': '[email protected]',
'migration_input': '/NonPrD0_pT-1p2_y-2p4_pp_13TeV_pythia8/RunIILowPUAutumn18DR-102X_upgrade2018_realistic_v15-v1/AODSIM#e38bd65b-5772-4b06-b0ce-a646d9a93fb5',
'migration_request_id': 2738882,
'migration_status': 0,
'migration_url': 'https://cmsweb.cern.ch/dbs/prod/global/DBSReader',
'retry_count': None}

@yuyiguo
Copy link
Member

yuyiguo commented May 14, 2019

Sorry Stefano I did not explain what I meant by "reenabled". It is that I changed the database status from 9 to 0 in the db. So that migration requests can be reprocessed by the migration server. 9 means permanently failed.

@belforte
Copy link
Member Author

OK. Not something I have an API for, anyhow should be the same as the migration delete + a new request which I did.
Alas the new request which I posted earlier failed:
In [73]: status[0]
Out[73]:
{'create_by': '[email protected]',
'creation_date': 1557837005,
'last_modification_date': 1557839875,
'last_modified_by': '[email protected]',
'migration_input': '/NonPrD0_pT-1p2_y-2p4_pp_13TeV_pythia8/RunIILowPUAutumn18DR-102X_upgrade2018_realistic_v15-v1/AODSIM#e38bd65b-5772-4b06-b0ce-a646d9a93fb5',
'migration_request_id': 2738882,
'migration_status': 3,
'migration_url': 'https://cmsweb.cern.ch/dbs/prod/global/DBSReader',
'retry_count': 1}

We can wait for the 3 retries, but it dos not look promising.
Back to: are there logs to checke anywhere ? Maybe I should look at Phys03Writer ?

@yuyiguo
Copy link
Member

yuyiguo commented May 14, 2019

I reenabled yesterday's migration again. The problem is that DBS global reader is in high load. When we migration from global to phys03, the migration server will use DBS API to read the block from global just like anyone else. If the block is big, it will be timeout .

@yuyiguo
Copy link
Member

yuyiguo commented May 14, 2019

It failed again.

@belforte
Copy link
Member Author

I can understand that, but would like to see confirmation in logs of this timeout. I can look at block size but maybe we need to also include parents. OTOH load should not be very high now, is it ?

@yuyiguo
Copy link
Member

yuyiguo commented May 14, 2019

The thing we need to figure out is which blockdump call is from migration server of phys03 in dbsGlobalReader logs. Let me check.

@yuyiguo
Copy link
Member

yuyiguo commented May 14, 2019

I am working on it. Please don't restart or delete these two migrations.

@belforte
Copy link
Member Author

OK. thanks.

@belforte
Copy link
Member Author

let me know if I can help

@yuyiguo
Copy link
Member

yuyiguo commented May 14, 2019

INFO:cherrypy.access:[14/May/2019:16:12:06] vocms0163.cern.ch 2001:1458:201:a8::100:1a9 "GET /dbs/prod/global/DBSReader/blockdump?block_name=%2FHIMinimumBias9%2FHIRun2018A-v1%2FRAW%232ad44aae-509b-4554-be3f-6802f1d66056 HTTP/1.1" 500 Internal Server Error [data: - in 135 out 200100579 us ] [auth: OK "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=lijimene/CN=817301/CN=Lina Marcela Jimenez Becerra" "" ] [ref: "" "PycURL/7.19.3 libcurl/7.35.0 OpenSSL/1.0.1r zlib/1.2.8 c-ares/1.10.0" ]

IS this in the time frame you were doing?

@belforte
Copy link
Member Author

no. THis must be a test from Lina

@belforte
Copy link
Member Author

ah sorry.... ignore. I misread the question.

@belforte
Copy link
Member Author

last example I posted here is
create_by': '[email protected]',
'creation_date': 1557837005,
that date is: Tue May 14 14:30:05 CEST 2019

there musts be a place in hell for those who write timestamps in logs w/o a timezone

@yuyiguo
Copy link
Member

yuyiguo commented May 14, 2019

the 500 error may related to the redeployment script. See here https://gitlab.cern.ch/cms-http-group/doc/issues/155#note_2584183

@belforte
Copy link
Member Author

i submitted the migration request well after DBS was restarted this morning.

@belforte
Copy link
Member Author

Migration status fro ASO Publisher today:
0 successes 129 failures. There were 31 failures 2 hours ago.

@belforte
Copy link
Member Author

OK. I understand now your point about https://gitlab.cern.ch/cms-http-group/doc/issues/155#note_2584183 . Let's see how it goes once that is solved. Thanks

@yuyiguo
Copy link
Member

yuyiguo commented May 14, 2019

Yes, let's wait for Lina to restart DBS with the new configuration. Then I will reenable one of the migration and see if we can get it going.

@yuyiguo
Copy link
Member

yuyiguo commented May 14, 2019

It did not look good, see below. The migration server got timeout while download the block from the source/DBSGlobalReader. This is after the last restart at 15:00 GMT.

Tue May 14 15:47:32 2019HTTP Error 502: Proxy Error
--------------------getResource--  Tue May 14 15:47:37 2019 Migration request ID: 2738882

Tue May 14 15:57:43 2019HTTP Error 502: Proxy Error
--------------------getResource--  Tue May 14 15:57:48 2019 Migration request ID: 2738882

Tue May 14 16:07:56 2019HTTP Error 502: Proxy Error
--------------------getResource--  Tue May 14 16:08:01 2019 Migration request ID: 2738882

Tue May 14 16:18:07 2019HTTP Error 502: Proxy Error
--------------------getResource--  Tue May 14 16:18:12 2019 Migration request ID: 2738882

@yuyiguo
Copy link
Member

yuyiguo commented May 17, 2019

Yes, Alan.

@amaltaro
Copy link
Contributor

Here it is: cms-sw/cmsdist#4983 (untested :D)

@yuyiguo
Copy link
Member

yuyiguo commented May 17, 2019

It looks good to me. Thanks Alan.
@h4d4 Lina, Could you please build and deploy this on testbed? I will test.

@yuyiguo
Copy link
Member

yuyiguo commented May 17, 2019

@bbockelm @amaltaro
#2 DBS-specific fix as suggested in #606
We are going to fixed in https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Database/DBCore.py#L123, right?

Individual DBS API fix may soon finding the same problem pop up from other places. The most heavy DBS APIs will do more than one sql calls and depend on the previous result. DBS is the most heavy user here. If a fix works for DBS, It should do for others too.

@h4d4
Copy link

h4d4 commented May 17, 2019

@yuyiguo Yuyi are you making reference to cms-sw/cmsdist#4983 ?

@h4d4
Copy link

h4d4 commented May 17, 2019

@yuyiguo @amaltaro Sure, I'm going to prepare the PR and put it on testbed today
I'm going to include it in HG1905.

@amaltaro
Copy link
Contributor

As a follow up of this 3 digits long GH thread, I created this WMCore to be addressed in the beginning of the next week: dmwm/WMCore#9207
If you have further suggestions/concerns, please make it there. But let's avoid to go again beyond 100 comments if possible :)

@belforte
Copy link
Member Author

belforte commented May 17, 2019 via email

@h4d4
Copy link

h4d4 commented May 17, 2019

@yuyiguo Yuyi,

I've built tag HG1905j including cms-sw/cmsdist#4983, and I've deploy it on testbed.
Tag history changes for validation results here.

I'll keep an eye on validation results, in order to coordinate where it should go to production if tests are successful.

Best Regards, Lina.

@yuyiguo
Copy link
Member

yuyiguo commented May 17, 2019 via email

@belforte
Copy link
Member Author

thanks all. I did just a quick check on blockDump on cmsweb-testbed (which was timing out yesterday) and it works in s couple of seconds now :-)

@yuyiguo
Copy link
Member

yuyiguo commented May 19, 2019

@h4d4
Lina,

I did validation tests. Everything is good for the cmsweb-testbed, but migration.

The problem is that for testing the migration, I need a source DBS and a destination DBS. In the cmsweb-testbed testing, the testbed is destination DBS. In regular case, the source DBS will be dbs global, however the DBS global does not work for blockdump due to the SQLAlchemy version. So I switched to my VM as the source DBS, I got errors that the migration server could not build the migration list. After some debug with a few times of restarting migration server on cmsweb-testbed with some prints, I found the reason was "Failed to connect to dbs3-test2.cern.ch port 8443: No route to host". That was kind surprised to me. I know my VM works when I do everything on the node. So I did a simple testing by log in lxplus and curl to DBS server on my VM . I got the same connection error as the migration server on cmsweb-testbed.

I am not sure if this is a feature of new VM or cmsweb. I recalled that I was able to access dbs servers on personal VMs as long as I was inside cern in the past. Then I tried to open the 8443/443 port, but failed. However, I did test the blockdump on cmsweb-testbed and worked well. The blockdump caused the past migration failure.

At this point, I don't know what we need to do with the deployment on prod. If we want to continue for full validation, @vkuznet Valentin or others may help on opening ports issue. Then I will test again. If the current test is enough, you may deploy on production.
Thanks,
Yuyi

@belforte
Copy link
Member Author

Well if we deploy on production and migration still fails, we are no worse off than we are now. May very well try.

@amaltaro
Copy link
Contributor

@yuyiguo how about triggering a migration from testbed to prod (given that the failure happens at the reader side)? If you want, I have a few datasets available only in testbed, e.g.:
[u'/RelValProdMinBias/DMWM_Test-ProdMinBias_TaskChain_ProdMinBias_Agent122_Validation_Privv2-v11/GEN-SIM',
u'/RelValProdMinBias/DMWM_Test-DIGIPROD1_TaskChain_ProdMinBias_Agent122_Validation_Privv2-v11/GEN-SIM-RAW',
u'/RelValProdMinBias/DMWM_Test-RECOPROD1_TaskChain_ProdMinBias_Agent122_Validation_Privv2-v11/AODSIM',
u'/RelValProdMinBias/DMWM_Test-RECOPROD1_TaskChain_ProdMinBias_Agent122_Validation_Privv2-v11/GEN-SIM-RECO']

Otherwise, I'm sending (just did) you a trick via private email that should allow you to connect to services running on your VM. Don't forget you need to allow your DN in the authmap.json file (check your crontab entry to see what I'm talking about)

@h4d4
Copy link

h4d4 commented May 20, 2019

Reading the latest inputs, my understanding is that there is still a need to make additional tests. I'll keep an eye on it.

@vkuznet
Copy link
Contributor

vkuznet commented May 20, 2019 via email

@yuyiguo
Copy link
Member

yuyiguo commented May 20, 2019

@h4d4
Lina and All,
All tests were successful. Please deploy DBS 3.7.8 with HG1902X for production.
Thanks Valentin and Alan for the help on ports.
Yuyi

@h4d4
Copy link

h4d4 commented May 20, 2019

@yuyiguo Yuyi, All,

Are you all agree in scheduling this intervention for tomorrow early in the morning?
Or are you suggesting doing it ASAP?

I need to announce it on hn-cms-cerncompannounce

Best Regards, Lina.

@yuyiguo
Copy link
Member

yuyiguo commented May 20, 2019 via email

@belforte
Copy link
Member Author

Tomorrow morning will be OK. Better not do changes to production late in the day.
And one day more or less makes no difference.

Big thanks to all of you !

@h4d4
Copy link

h4d4 commented May 20, 2019

@yuyiguo Yuyi, All,

Thanks for the feedback, therefore I'll run a production intervention for DBS's servers tomorrow at 9:00 AM GVA Time. I'm going to send the announcement to 'hn-cms-cerncompannounce'.

Best Regards, Lina.

@yuyiguo
Copy link
Member

yuyiguo commented May 20, 2019 via email

@yuyiguo
Copy link
Member

yuyiguo commented May 20, 2019

@h4d4
Lina,
We need to bring DBS int db to the data level of prod. It will take one or two days to dump the data from prod to int and cmsweb-testbed DBS will not available during that time. What is a good time to do so?
Cheers,
Yuyi

@h4d4
Copy link

h4d4 commented May 20, 2019

@yuyiguo Yuyi,
Since tomorrow will start HG1906 preproduction validation, DBS testbed outages could affect validation for services such DAS. @vkuznet
Then I saw there two possibilities:

  • Start the procedure now (if it is really urgent).
  • Do it once HG1906 preproduction validation ends, it is on 03rd June.

Best Regards, Lina.

@amaltaro
Copy link
Contributor

Lina, I'm not sure I understand what you're proposing. We run the production intervention tomorrow morning, then once it's done; you start working on the testbed new release.

Am I missing anything?

@belforte
Copy link
Member Author

The question is: who is using the DBS Oracle INT instance in testbed ?

@amaltaro
Copy link
Contributor

amaltaro commented May 20, 2019

I suggest we discuss the prod --> int DB copy/dump in another thread not to mix too many things here. And yes, WMCore team is using the cmsweb-testbed int database almost daily.

@h4d4
Copy link

h4d4 commented May 21, 2019

@yuyiguo @vkuznet @belforte @amaltaro
I've created this issue regarding the required procedure 'int DB dump'. Please comment there.
Best Regards, Lina.

@yuyiguo yuyiguo closed this as completed Apr 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants