-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DBS API BulkBlock input size control #599
Comments
Yuyi,
I think you answered your own question, it is not normal that frontend should
stuck with 200MB request to process it and it is not normal that DBS should
struggle too. My feeling that we should introduce throttling at both.
The frontend throttling will slow down frequent clients while DBS will weight
on client's patterns.
For apache we can use mod_evasive or mod_throttle, while for DBS backend
I didn't find explicitly cherrypy solution and we should probably write our own.
But for Flask we can use
http://flask.pocoo.org/snippets/70/
In a past I already made everything for mod_evasive which is part of cmsdist
repository now and we just need to revisit its specs and configuration.
Valentin.
…On 0, Yuyi Guo ***@***.***> wrote:
@bbockelm @amaltaro @belforte @vkuznet and @ALL
DBS database became bigger with time and DBS servers are getting more loads. We no longer have the luxury to load huge files. The most recently issues was a block with 500 files, sized about 200 MB and 1,643,229 lumi sections. This block could not even be load entire data trough the front end.
Now it is the time that we start look into how DBS should make the limit. What is reasonable limits? Limits on block size, number of files and number of lumi sections in a block?
Currently, WMagents have total 500 files per block as limit. But the files various a lot.
I am not sure what the limit crab put in.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#599
|
About last point: CRAB Publisher is currently configured for 100 files/block. There is also a limit on how many lumis can be in an input block at 100k. Since one job can not cross block boundaries this gives a max. of 100k lumis in one output file if someone process data which are "at the edge" and wants output in DBS. Is this the time to question whether and how we want to store lumilist for nanoAOD. So far the above limitation results in "can read nanoAOD with CRAB, but only with splitByFile". But I suspect nobody tried to store in DBS output of "nanoAOD skimming" resulting in even more lumis/file. In the end someone could have search analysis producing one file with maybe 100 events, but all lumis in all CMS runs ! |
Valentin, protecting DBS from code which ran astray is good, but here we need also to define how we should use DBS so that it keeps working smoothly for us. Breaking large inputs in pieces may avoid FE timeouts, but do we really need to push those enormous JSON lists in Oracle ? |
Thanks for starting this discussion Yuyi. We also have to come up with better thresholds for the clients (aka CRAB and WMAgent). Imposing these limitations will always be sup-optimal though and people need to be aware that it won't come for free (like small blocks here and there). BTW, that json is very large because we post that info to DBS with the keys/binds already formatted for the DAO, that's why the volume is large (which saves quite some CPU cycles on the DBS server side). |
Alan, which kind of dataset was that huge block for ? I do not see how 500 files could be a problem, but 1.6 M lumis in ASCII formatted JSON really sounds a lot to digest. Why do we store lumilist in a relational DB ? Is it only for answering the question "give me the file(s) in this dataset which contain lumi X from run R" ? I do not see that question as useful for highly compacted data tiers. |
On this subject, how lumis are provided, as array of ints? Can we use range of
lumis which may significantly reduce the size of uploaded document?
…On 0, Stefano Belforte ***@***.***> wrote:
Alan, which kind of dataset was that huge block for ? I do not see how 500 files could be a problem, but 1.6 M lumis in ASCII formatted JSON really sounds a lot to digest. Why do we store lumilist in a relational DB ? Is it only for answering the question "give me the file(s) in this dataset which contain lumi X from run R" ? I do not see that question as useful for highly compacted data tiers.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#599 (comment)
|
CRAB uses the format from this example to fill the structure to be passed to I would distinguish three things here:
|
Thanks all for the discussion here. For the huge block of 200 files, you may find some f details at https://its.cern.ch/jira/browse/CMSCOMPPR-5196. Regarding the input data format, the link Stefano pointed out is the current input requirement. We designed this format because we want to have the date to be insert without reformating in DBS. but if this format is the problem, we definitely can redesign it to reduce the input volume. However, we have the 300 minutes limit posted on the server, reformatting the data will increase the time and memory in DBS. What we want to trade here? Even if we reduce the data volume to passing into DBS, 1.6 millions of lumi inserted into DBS still is a big challenge. So my idea is to find a balance. |
Breaking the bulkblock insertion into multiple API was what DBS2 did. All the people experienced DBS2 knew what the problems were. I would not discuss here. I do not think that we are going back that route unless we really want to redesign DBS for DBS4. |
Proposal:
WMA and CRAB should set a limit at 10k lumis per block. Period.
And let's see what breaks, if any, and if it really needs to be fixed
let's fix in in a different way then redesigning DBS.
I do not know about WMA, but CRAB will refuse to read the lumilist
for a block which has more than 100K lumis, I really do not see
the point in creating that monster.
Then looking at this long and confused thread (thanks Yuyi)
#599:
IIUC this crap originates from some GS (GenSim ?) thins where clearly
the #lumis/job must have been set completely wrong.
If that's the case the only cure is to detect as early as possible
and send crap back upstream as quick as possible.
Trying to accommodate any silly request that comes our way is not good.
In https://its.cern.ch/jira/browse/CMSCOMPPR-5196 at some point
JR correctly points out that 3ev/lumi is silly. Lumis in
Gen are ONLY there to allow to process output at a sub-file
level using split-by-lumi, 3 eve/lumis makes no sense and must
be rejected upfront.
Why did we process this anyhow ?
Why do people do ACDC for GenSim ?
We will be better off by revisiting the lumi in Gen thing and
stop putting lumis there, since differently from data those lumis
do not come from real life (# seconds) and there's no limit to how
much problems we can have from wrong configurations.
Why do we have to spend time debugging how to insert such block in DBS ?
We should not just push things around blindly until they somehow "go",
but find the core issue down at the root and solve things there.
Where's the DESIGN part here ?
P.S. yet I am glad this came about, because I already asked long ago
for a DBS-side defined limit on what it could handle so I break
things CRAB-side, but could not get an answer. Although I understand
that the first limit comes from the 5min CMSWEB FE timeout, large
requests may never reach DBS BE. In new CMSWEB arch this may change,
but I would not like clients to keep connections which lasts minutes anyhow.
|
and clearly for a GenSim dataset there is absolutely no reason to be prepared to answer "give me the file which contains lumi number X". Why do we push that list in an Oracle table ? Masochism ? |
I had a quick look at data-format, it is WAY TOO LOOSE and current
representation can be cut in 1/2 very easily. Here is few suggestions:
- the dict key names are too long, compare `lumi_section_num` vs 123
- so if you'll replace long string names with their short representation you
can cut significant amount of data, e.g.
use `ls` instead of `lumi_section_num`, use `fl` instead of `file_lumi_list`, etc.
- replace nested structures with flat format, e.g. instead of
```{'file_lumi_list':[{u'lumi_section_num': 27414, u'run_num': 1}, {u'lumi_section_num': 26422,u'run_num': 2} ...]}```
which takes TOO MUCH memory when python tries to allocate every dict in a list.
This should be replaced with simpler structure like:
```{'fl':[(27414,1),(26422,2),...]}```
The advantage of later that python will allocate only ONE dict instead of many.
This will not only reduce size of the input, but also reduce size of memory
allocation on DBS server, i.e. win-win situation.
- you can further optimize data-format by using lists instead of list of dicts,
e.g. instead of using
```
'file_conf_list': [{u'release_version': 'CMSSW_1_2_3', u'pset_hash':
'76e303993a1c2f842159dbfeeed9a0dd', u'lfn': '/store/data/a/b/A/a/1/abcd0.root',
u'app_name': 'cmsRun', u'output_module_label': 'Merged', u'global_tag': 'my-cms-gtag::ALL'},
{u'release_version': 'CMSSW_1_2_3', u'pset_hash':
'76e303993a1c2f842159dbfeeed9a0dd', u'lfn': '/store/data/a/b/A/a/1/abcd1.root',
u'app_name': 'cmsRun', u'output_module_label': 'Merged' , u'global_tag': 'my-cms-gtag::ALL'},
...
]
```
you can use flat structure, e.g.
```
"fl":[
['CMSSW_1_2_3','76e303993a1c2f842159dbfeeed9a0dd','/store/data/a/b/A/a/1/abcd0.root',...]
['CMSSW_1_2_3','76e303993a1c2f842159dbfeeed9a0dd','/store/data/a/b/A/a/1/abcd1.root',...]
]
```
this will further reduce the data size.
I bet that only doing this optimization you can reduce 200MB to O(10)MB.
I understand that it will require changes to both DBS server and clients but
using JSON without thinking about consequences is not optimal. We're not in
mercy to waste resources anymore and proper optimization should be in place.
If you'll agree we can outline proper format for (at least this) DBS API
and start campaign of enforcing new data-format.
…On 0, Stefano Belforte ***@***.***> wrote:
CRAB uses the format from this example to fill the structure to be passed to `insertBulkBlock`
https://github.com/dmwm/DBS/blob/master/Client/tests/dbsclient_t/unittests/blockdump.dict
since we could not find any other documentation.
In that every file is a list of dictionaries one of which is a list of {'run':int; 'lumi':int}
We have not touched that code since it was written early in DBS3 history.
I would distinguish three things here:
1. how to pass that list efficiently (ranges may not get more than a few O(1) factors since there are many gaps lumis are scattered almost at random in initial RAW files)
2. how to store that information (i.e. which kind of query and/or retrieval do we want)
3. when to store it (i.e. for which files/datasets)
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#599 (comment)
|
@vkuznet clearly a leaner protocol will help. Maybe such a change can be kept inside current DBS client API to avoid changes to WMA/CRAB ? OTOH I hope we can also make some progress on what exactly we need from DBS. Even if we could store 10M lumis for one block, do we really want to do it ? |
Stefano, flat budget, reduced manpower and increase of load due to more data
give us now choice but be efficient. We should not think about convenience
of "reading" our data in human format, but rather concentrate on efficiency
of our system. I don't mind to keep format changes isolated to DBS server and
client, but I think it is Yuyi's call.
And, decision on what to be stored in DBS is a parallel and independent
issue from data flow optimization.
…On 0, Stefano Belforte ***@***.***> wrote:
@vkuznet clearly a leaner protocol will help. Maybe such a change can be kept inside current DBS client API to avoid changes to WMA/CRAB ?
E.g. flat lists surely are efficient but are error-prone to be given for use to naive code writers (like me), while a well coded and validated method can take the verbose thing and zip it at best.
Why not start with `insertBulkBlock` returning an error when it thinks input is too large ?
Then it can surely relax the limits once is able to reduce to more compact structure
and evaluate that.
OTOH I hope we can also make some progress on what exactly we need from DBS. Even if we could store 10M lumis for one block, do we really want to do it ?
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#599 (comment)
|
I have a release on Monday. I will go over the discussion later.
From: Valentin Kuznetsov <[email protected]>
Reply-To: dmwm/DBS <[email protected]>
Date: Tuesday, April 16, 2019 at 2:53 PM
To: dmwm/DBS <[email protected]>
Cc: Yuyi Guo <[email protected]>, Author <[email protected]>
Subject: Re: [dmwm/DBS] DBS API BulkBlock input size control (#599)
Stefano, flat budget, reduced manpower and increase of load due to more data
give us now choice but be efficient. We should not think about convenience
of "reading" our data in human format, but rather concentrate on efficiency
of our system. I don't mind to keep format changes isolated to DBS server and
client, but I think it is Yuyi's call.
And, decision on what to be stored in DBS is a parallel and independent
issue from data flow optimization.
On 0, Stefano Belforte ***@***.***> wrote:
@vkuznet clearly a leaner protocol will help. Maybe such a change can be kept inside current DBS client API to avoid changes to WMA/CRAB ?
E.g. flat lists surely are efficient but are error-prone to be given for use to naive code writers (like me), while a well coded and validated method can take the verbose thing and zip it at best.
Why not start with `insertBulkBlock` returning an error when it thinks input is too large ?
Then it can surely relax the limits once is able to reduce to more compact structure
and evaluate that.
OTOH I hope we can also make some progress on what exactly we need from DBS. Even if we could store 10M lumis for one block, do we really want to do it ?
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#599 (comment)<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_DBS_issues_599-23issuecomment-2D483802098&d=DwQFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=8bursUuc0V63OwREQMBG2Q&m=iJNoFTSxd_aSdfikrFUYis_Lzat1gF9NTaDQU2KXXQg&s=grvPo3dDxoNrLe6PzfCSHPIAHqKouBnVrrn35yJee1A&e=>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_DBS_issues_599-23issuecomment-2D483819211&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=8bursUuc0V63OwREQMBG2Q&m=iJNoFTSxd_aSdfikrFUYis_Lzat1gF9NTaDQU2KXXQg&s=kR2m48akomtOXyzLswO660AvUTjWv-xh9EU5xME-UaQ&e=>, or mute the thread<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABsXTj-2DDtMCmkAb00bbDZ4dPaD3Tyf-5F0ks5vhipCgaJpZM4cw8E3&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=8bursUuc0V63OwREQMBG2Q&m=iJNoFTSxd_aSdfikrFUYis_Lzat1gF9NTaDQU2KXXQg&s=Ok0aS6JHBJpERkaxPg8xpFUn8Cz_Ih2Ia54jQuZXIyo&e=>.
|
Is it possible the issue is not the size of the lumi information but rather how we are loading it? That is, any web server worth its salt should be able to easily handle a 200MB POST -- however, it's going to be extremely difficult to manage such a thing if all 200MB have to be buffered to memory at once! @vkuznet - does the frontend need to load the full POST before it can start proxying the request to the remote side? A few thoughts:
Looks like a little medicine in the implementation might be able to go a long way, especially with respect to a streaming JSON decoder. |
Brian, see my comments inline
Is it possible the issue is not the size of the lumi information but rather how we are loading it?
I didn't look explicitly into DBS/CRAB APIs, but it seems to me that answer is
yes based on current document structure, i.e. we send whole JSON which contains
nested data-structures which cause DBS memory blow-out.
That is, any web server worth its salt should be able to easily handle a 200MB POST -- *however*, it's going to be extremely difficult to manage such a thing if all 200MB have to be buffered to memory at once! @vkuznet - does the frontend need to load the full POST before it can start proxying the request to the remote side?
I doubt that frontend need to load full POST request, we can use chunked POST requests if necessary.
A few thoughts:
1. How many APIs (or API implementations) need "fixed"?
2. Do we need to switch to a streaming JSON decoder/encoder? Is there a reason to render the whole structure in memory inside DBS?
yes, we need JSON streaming and I think we have implementation for that in
WMCore, but to implement it on DBS side we need to change data format(s)
3. If we are treating the lumi information as opaque blobs, why not compress them and never fully decompress on the server side?
it is a good suggestion, but we need to be carefully here since user wants to
look-up back this information, so decompression will happen in a different part
(e.g. DAS).
Looks like a little medicine in the implementation might be able to go a long way, especially with respect to a streaming JSON decoder.
yes
|
It already creates annoying memory problems in CRAB code when
we build this JSON, which is why we still have to 'publish in DBS'
in an adhoc machine rather than as part of job post-processing in the schedd.
(and one of the reason why CRAB does not try to put more than 100 files in one block)
…On 18/04/2019 11:22, Valentin Kuznetsov wrote:
I didn't look explicitly into DBS/CRAB APIs, but it seems to me that answer is
yes based on current document structure, i.e. we send whole JSON which contains
nested data-structures which cause DBS memory blow-out.
|
seem my previous question: why do we store this ? To serve the list to users upon request,
or to allow Oracle find 'all files for lumi X, run Y in the whole CMS sample' ?
We do not care to be able to accommodate any silly user request, but must be
sure physics is still done. It is not only the sending of this, I worry (maybe w/o
reason) about the biggest table in DBS getting bigger and bigger.
4/2019 11:22, Valentin Kuznetsov wrote:
… > 3. If we are treating the lumi information as opaque blobs, why not compress them and never fully decompress on the server side?
it is a good suggestion, b
|
Let me point you to a history discussion with Lassi which I had 8 years ago:
https://hypernews.cern.ch/HyperNews/CMS/get/webInterfaces/753.html
https://hypernews.cern.ch/HyperNews/CMS/get/webInterfaces/752.html
Over there he proposed and we discussed compact JSON format (by the time it
was relevant for PhEDEx and DAS). In particular, I look-up my emails and found
that (quoting):
**
The tests I've done in a past shown that 200 MB of phedex data (current JSON
data structure of all blocks) requires > 1 GB of RAM for JSON parsing, while the
parsing of the same data using XML can be done at a cost of 20MB of RAM (+). The
phedex JSON representation is basically list holding dicts (RAM grows due to
allocation of dicts in open list). ...
**
By that time PhEDEx JSON format was similar to what DBS uses now, i.e.
JSON which holds nested data-structures. My measurements shown that identical
XML representation had 10 times less memory consumption that's why Lassi
proposed "flat" JSON format suitable for streaming and low-memory footprint.
The jsonstreamer decorator I used in DAS:
https://github.com/dmwm/DAS/blob/master/src/python/DAS/web/das_web_srv.py#L714
https://github.com/dmwm/DAS/blob/master/src/python/DAS/web/tools.py#L156
and it is available in WMCore:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/ReqMgr/Web/tools.py#L160
This code is based on studies I had with Lassi, and can be adopted to DBS APIs.
…On 0, Brian P Bockelman ***@***.***> wrote:
Is it possible the issue is not the size of the lumi information but rather how we are loading it?
That is, any web server worth its salt should be able to easily handle a 200MB POST -- *however*, it's going to be extremely difficult to manage such a thing if all 200MB have to be buffered to memory at once! @vkuznet - does the frontend need to load the full POST before it can start proxying the request to the remote side?
A few thoughts:
1. How many APIs (or API implementations) need "fixed"?
2. Do we need to switch to a streaming JSON decoder/encoder? Is there a reason to render the whole structure in memory inside DBS?
3. If we are treating the lumi information as opaque blobs, why not compress them and never fully decompress on the server side?
Looks like a little medicine in the implementation might be able to go a long way, especially with respect to a streaming JSON decoder.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#599 (comment)
|
Here is fully working example of jsonstreamer (save it as jsonstreamer.py):
Now if you'll run it
Now we only need to write server side which will read chunks and then compose JSON object. |
You can check it out with more sophisticated nested python structure, e.g.
but I will not paste the output of this since it is kind of big. |
And, now I completed full example, you can see it here: https://gist.github.com/vkuznet/e90b5a7cc92005df7d33877abde3206f It provides the following:
If you'll run the code you'll get the following output:
So even in this basic example the original dict |
The corresponding PR which provides support for different input formats can be found here: #618 |
@bbockelm @amaltaro @belforte @vkuznet and @ALL
DBS database became bigger with time and DBS servers are getting more loads. We no longer have the luxury to load huge files. The most recently issues was a block with 500 files, sized about 200 MB and 1,643,229 lumi sections. This block could not even be load entire data trough the front end.
Now it is the time that we start look into how DBS should make the limit. What is reasonable limits? Limits on block size, number of files and number of lumi sections in a block?
Currently, WMagents have total 500 files per block as limit. But the files various a lot.
I am not sure what the limit crab put in.
The text was updated successfully, but these errors were encountered: