Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blocks failing to be inserted into DBS - submit4 #10887

Closed
amaltaro opened this issue Nov 2, 2021 · 14 comments · Fixed by dmwm/DBSClient#36
Closed

Blocks failing to be inserted into DBS - submit4 #10887

amaltaro opened this issue Nov 2, 2021 · 14 comments · Fixed by dmwm/DBSClient#36

Comments

@amaltaro
Copy link
Contributor

amaltaro commented Nov 2, 2021

Impact of the bug
WMAgent

Describe the bug
Before shutting down cmsgwms-submit4, which is running 1.5.2 and is fully drained, I noticed that there is a handful of blocks that are failing to be inserted into DBS:

{'/QCD_HT300to500_BGenFilter_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL17MiniAOD-106X_mc2017_realistic_v6-v3/MINIAODSIM',
 '/QCD_HT300to500_BGenFilter_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL17RECO-106X_mc2017_realistic_v6-v3/AODSIM',
 '/QCD_Pt_600to800_TuneCP5_13TeV_pythia8/RunIISummer20UL18NanoAODv9-20UL18JMENano_106X_upgrade2018_realistic_v16_L1v1-v1/NANOAODSIM',
 '/ST_t-channel_antitop_4f_hdampup_InclusiveDecays_TuneCP5_13TeV-powheg-madspin-pythia8/RunIISummer20UL17NanoAODv2-106X_mc2017_realistic_v8-v1/NANOAODSIM',
 '/ST_t-channel_top_4f_hdampdown_InclusiveDecays_TuneCP5_13TeV-powheg-madspin-pythia8/RunIISummer20UL17NanoAODv2-106X_mc2017_realistic_v8-v1/NANOAODSIM',
 '/TTToSemiLeptonic_TuneCP5_13TeV-powheg-pythia8/RunIISummer20UL16NanoAODv9-20UL16JMENano_106X_mcRun2_asymptotic_v17-v1/NANOAODSIM'}

error in the DBS3Upload is 400 Bad Request.
It is not clear whether this same issue is also happening in other agents or not. We should verify that.

How to reproduce it
Not sure

Expected behavior
It would be helpful to know further details of why the bulk insert call fails. Besides that, we should of course make sure that the block dump information is correct, and if so, it should be successfully inserted into DBS Server.

Additional context and error message
A dump of one of those blocks can be found at: https://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_10887/dbsuploader_block.json

@amaltaro
Copy link
Contributor Author

amaltaro commented Nov 2, 2021

@yuyiguo Yuyi, is there any easy way to find out why these blocks are being rejected by the server (or maybe even by the dbs3-client?)? I see something strange in that json dump:

      "output_module_label": "Merged",

vs

      "output_module_label": "AODSIMoutput",

Could this be the reason why this block fails to be inserted (I did not check whether the other blocks have the same issue though).

@yuyiguo
Copy link
Member

yuyiguo commented Nov 2, 2021

@amaltaro ,

Which DBS server were these blocks inserted ?

@yuyiguo
Copy link
Member

yuyiguo commented Nov 2, 2021

"Merged" is in multiple configs in prod.

@yuyiguo
Copy link
Member

yuyiguo commented Nov 2, 2021

@amaltaro

I did not see any problems from the data. I was able to upload your example block to int db

@yuyiguo
Copy link
Member

yuyiguo commented Nov 2, 2021

@amaltaro

I need more info from you in order to debug the problem:

  1. which server did you upload? cmsweb-prod?
  2. . What was the time frame you tried to upload? I need to check the log?
  3. Can you retry the upload?
    @muhammadimranfarooqi
    Can you point me where is the logs for cmsweb-prod? From vocms0750, I only found k8s server logs. Where can the combined logs?

@muhammadimranfarooqi
Copy link

Hi @yuyiguo

The logs are in /cephfs/cmsweb-prod/frontend-logs location in vocms0750

@amaltaro
Copy link
Contributor Author

amaltaro commented Nov 2, 2021

@yuyiguo thanks for the prompt feedback. I understand that the json construction is okay then, and that that Merged output module is also correct and expected on the server side, right?

Answering your questions:

  1. this is the server we are using to write data to: https://cmsweb-prod.cern.ch/dbs/prod/global/DBSWriter
  2. this block:
/QCD_HT300to500_BGenFilter_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL17RECO-106X_mc2017_realistic_v6-v3/AODSIM#cef2e470-c73d-4022-bb02-b989e409b58d

is actually failing to be inserted into DBS since 15/Oct (!!!). The same operation fails over and over, every few minutes (sometimes every hour or two). Here is the last 2 timestamps from the log (local time, thus FNAL time):

2021-11-02 14:01:31,814:140502769035008:ERROR:DBSUploadPoller:Error trying to process block /QCD_HT300to500_BGenFilter_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL17RECO-106X_mc2017_realistic_v6-v3/AODSIM#cef2e470-c73d-4022-bb02-b989e40
9b58d through DBS. Error: HTTP Error 400: Bad Request
2021-11-02 14:03:21,606:140502769035008:ERROR:DBSUploadPoller:Error trying to process block /QCD_HT300to500_BGenFilter_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL17RECO-106X_mc2017_realistic_v6-v3/AODSIM#cef2e470-c73d-4022-bb02-b989e40
9b58d through DBS. Error: HTTP Error 400: Bad Request
  1. The component is still trying to insert that block, and it keeps failing. Which makes me wonder whether we have a different error getting propagated now for data already inserted into the database(?)

@yuyiguo
Copy link
Member

yuyiguo commented Nov 2, 2021

The logs are in /cephfs/cmsweb-prod/frontend-logs location in vocms0750

@muhammadimranfarooqi thanks for the quick reply. I am looking for the log for dbs server on VMs, not the front end.

@yuyiguo
Copy link
Member

yuyiguo commented Nov 2, 2021

@muhammadimranfarooqi I found them.

@muhammadimranfarooqi
Copy link

@yuyiguo

You can find logs in vocms055 at /build/srv-logs location

@yuyiguo
Copy link
Member

yuyiguo commented Nov 2, 2021

@amaltaro
see the error from log files and DB query result:

ERROR:dbs.web.DBSWriterModel:Tue Nov 2 00:02:23 2021 dbsException-invalid-input2: DBSBlockInsert/insertBlock. Block /QCD_HT300to500_BGenFilter_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL17RECO-106X_mc2017_realistic_v6-v3/AODSIM#cef2e470-c73d-4022-bb02-b989e409b58d already exists.

select * from blocks where block_name
 ='/QCD_HT300to500_BGenFilter_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL17RECO-106X_mc2017_realistic_v6-v3/AODSIM#cef2e470-c73d-4022-bb02-b989e409b58d';

25293672 /QCD_HT300to500_BGenFilter_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL17RECO-106X_mc2017_realistic_v6-v3/AODSIM#cef2e470-c73d-4022-bb02-b989e409b58d 14258678 0 T2_US_Nebraska 21246241027 4 1634318499 [email protected] 1634318499 [email protected]

The block was inserted into DBS Friday, October 15, 2021 12:21:39 PM GMT-05:00 DST.

@amaltaro
Copy link
Contributor Author

amaltaro commented Nov 2, 2021

Haaa, my guess was correct then. I do not think there were any changes on the DBS Server side over the last 6 months or more, right? The component code expects this error:
https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/DBS3Buffer/DBSUploadPoller.py#L89

but it is no longer returned. Which likely means that that error report is no longer properly handled either in WMCore, or in the dbs3-client itself. Anyhow, now I should have all the necessary information to properly debug and provide a fix for this. Thanks again, Yuyi!

@amaltaro
Copy link
Contributor Author

amaltaro commented Nov 3, 2021

Issue identified and fix provided in: dmwm/DBS#660

And here is the command to patch dbs3-client in our agents:

curl https://patch-diff.githubusercontent.com/raw/dmwm/DBS/pull/660.patch | patch -d sw/slc7_amd64_gcc630/cms/py3-dbs3-client/3.1*/lib/python3.8/site-packages/ -p 4

followed by a restart of the DBS3Upload component:

$manage execute-agent wmcoreD --restart --components=DBS3Upload

@amaltaro
Copy link
Contributor Author

amaltaro commented Nov 3, 2021

All the production and some of the testbed agents have been patched. This issue can be closed as soon as the DBS pull request gets merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants