-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert unicode to bytestring within the dumpBlock server API #606
base: master
Are you sure you want to change the base?
Conversation
Actually |
test this please |
Utilitarian function which converts an unicode string to | ||
an 8-bit string. | ||
""" | ||
if isinstance(unicodeStr, basestring): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
basestring
is not supported in python3, instead better use str
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but at least we know this is a case that will hit us again when we move to python3. So it's easier to identify when we start the real migration.
an 8-bit string. | ||
""" | ||
if isinstance(unicodeStr, basestring): | ||
return unicodeStr.encode("utf-8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to convert to utf-8
and not to ascii
? Do we store dataset/block/lfn as utf8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's a question to Yuyi. utf-8 supports ascii, so we should be fine as well (not sure about database performance though).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that proper place to fix unicode->string conversion is at DAO level of DBS code and not here. For instance this code calls self.blocklist.execute
function and pass dataset name, block name, etc. Why should we fix every occurrence of self.blocklist.execute
instead of fixing the underlying function which perform SQL action. The code I'm talking about is here:
https://github.com/dmwm/DBS/tree/027b7ae0839e788c244277833c86b17a4aaecb91/Server/Python/src/dbs/dao/Oracle
and every subdirectory contains DAO object, e.g. Block/List.py execute function is defined here:
def execute(self, conn, dataset="", block_name="", data_tier_name="", origin_site_name="", logical_file_name="", |
My suggestion is to move convertByteStr into utils area of DBS code to make it re-usable by other modules and then fix underlying DAO object to convert given dataset/block/file names. Then the rest of the DBS code will be fixed automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review, Valentin. I tried to keep a well performing code here, that's why an ad-hoc solution.
Utilitarian function which converts an unicode string to | ||
an 8-bit string. | ||
""" | ||
if isinstance(unicodeStr, basestring): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but at least we know this is a case that will hit us again when we move to python3. So it's easier to identify when we start the real migration.
an 8-bit string. | ||
""" | ||
if isinstance(unicodeStr, basestring): | ||
return unicodeStr.encode("utf-8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's a question to Yuyi. utf-8 supports ascii, so we should be fine as well (not sure about database performance though).
I just want to provide a pointer to **exactly** the same problem I fixed in
RESTModel code 3 years ago (I even put my comment starting with VK about it):
dmwm/WMCore@b97d8ee#diff-27caaae36aa502e2719c31060d1bfa8fR237
So, on server side the input arguments should be converted from unicode to string.
…On 0, Alan Malta Rodrigues ***@***.***> wrote:
Fixes #605
Possible workaround for dealing with Unicode and byte strings inside the `dumpBlock` API.
@yuyiguo I create a simple script querying a couple of DBS APIs, either with an unicode string as input, or as a byte string as input, and I don't see any issues.
Did you manage to get the exact line where the SQL query hangs? I assume it's this one:
https://github.com/dmwm/DBS/compare/master...amaltaro:unicode-blockDump?expand=1#diff-3c13637f700d283f5305fd63f2a8bc4aR130
?
PS.: This module has a mix of spaces and tabs, so I had to put many more changes in than what I wanted.
You can view, comment on, or merge this pull request online at:
#606
-- Commit Summary --
* Convert unicode to bytestring within the dumpBlock server API
-- File Changes --
M Server/Python/src/dbs/business/DBSBlock.py (64)
-- Patch Links --
https://github.com/dmwm/DBS/pull/606.patch
https://github.com/dmwm/DBS/pull/606.diff
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
#606
|
unicodeStr.encode("ascii") is enough. |
use ascii encoding instead
027b7ae
to
5c4be59
Compare
Updated the code to it of course will add a global overhead to any service relying on the WMCore/Database package. |
Jenkins results:
|
Alan,
I agree this a proper place, even lower then I suggested at DAO level. And, yes
it will introduce the overhead but indeed it will fix this problem for
everything.
Can someone benchmark the overhead. May be we can write C-extension for that to
speed up things.
V.
…On 0, Alan Malta Rodrigues ***@***.***> wrote:
Updated the code to `ascii`.
A global fix could be made at this level (when we parse all the binds):
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Database/DBCore.py#L123
it of course will add a global overhead to any service relying on the WMCore/Database package.
However, at least this issue wouldn't hit us again. In addition to that, we could then - probably -remove the sanitization Valentin put in at the WebTools level.
--
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
#606 (comment)
|
Alan, Valentin, I agreed that fixing should be at https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Database/DBCore.py#L123 Alan, Do you want to make the PR since this is in WMCore land? Otherwise, I will do it. |
Yes, I can work on it, but probably only on Monday. So if you feel like coming with a PR before that, please be my guest. |
Thanks Alan. Monday is fine with me so I leave this to you. I am working on throttling code that needs to going to June release. I hope it won't take to long to get it work properly. I am running off my time for CMS already. |
Yuyi,
if we're going to do this in DBCore.py#123 level it means that we need to
benchmark binds dict.
Then I would suggest take one LFN name, it does not matter which one, and create
binds dict with bunch of identical LFN names as unicode. The size you need to
test should correspond to largest block, i.e. find out how many LFNs we have in
largest block and create a dict/list of LFNs for bind.
Then take Alan's code and iterate binds dict with it converting all LFNs in it.
Please note, that if we'll work with bind dicts we need to make in-place
conversion, e.g.
```
# pseudo-code flow, adjust as necessary using proper bind dict structure
time0 = time.time()
for key in binds.keys():
binds[key] = convertBytesStr(binds[key])
print("elapsed time:", time.time()-time0)
```
But we should be careful since binds dict will be changed and if upstream code
rely on it we should do something different. I want to avoid deep copy of the
dict to perform conversion.
…On 0, Yuyi Guo ***@***.***> wrote:
Alan, Valentin,
I agreed that fixing should be at https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Database/DBCore.py#L123
too.
Alan, Do you want to make the PR since this is in WMCore land? Otherwise, I will do it.
Valentin, How do you want to branchmark it? I can blockdump a bunch of blocks to test it.
THanks,
Yuyi
--
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
#606 (comment)
|
Valentin, |
Yuyi,
I'm not sure I understood your comment. I think we will do unicode to byte
string conversion via `convertBytesStr` funcion. If so, you'll need to benchmark
a loop with set of unicode strigns (LFNs) converting to string object. You may
use whatever DBS API is appropriate for that only with reasonable (better
largest) amount of LFNs we need to deal with.
…On 0, Yuyi Guo ***@***.***> wrote:
Valentin,
The change will be made in DBCore.py#123. We do not need to do
binds[key] = convertBytesStr(binds[key]), right?
If we want to branchmark the unicode to byte string conversion, I just need to call listFileLumis API with a long list of lfn and counting the time. I could just comment out the DB part and return the time.
Yuyi
--
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
#606 (comment)
|
Fixes #605
Possible workaround for dealing with Unicode and byte strings inside the
dumpBlock
API.@yuyiguo I create a simple script querying a couple of DBS APIs, either with an unicode string as input, or as a byte string as input, and I don't see any issues.
Did you manage to get the exact line where the SQL query hangs? I assume it's this one:
https://github.com/dmwm/DBS/compare/master...amaltaro:unicode-blockDump?expand=1#diff-3c13637f700d283f5305fd63f2a8bc4aR130
?
PS.: This module has a mix of spaces and tabs, so I had to put many more changes in than what I wanted.