Convert unicode to bytestring within the dumpBlock server API #606

amaltaro · 2019-05-17T09:41:06Z

Fixes #605
Possible workaround for dealing with Unicode and byte strings inside the dumpBlock API.

@yuyiguo I create a simple script querying a couple of DBS APIs, either with an unicode string as input, or as a byte string as input, and I don't see any issues.
Did you manage to get the exact line where the SQL query hangs? I assume it's this one:
https://github.com/dmwm/DBS/compare/master...amaltaro:unicode-blockDump?expand=1#diff-3c13637f700d283f5305fd63f2a8bc4aR130
?

PS.: This module has a mix of spaces and tabs, so I had to put many more changes in than what I wanted.

amaltaro · 2019-05-17T10:17:33Z

Actually convertByteStr function will still return a unicode object type, but represented in 8-bits instead of 16 or 32. I'm not sure whether it is enough for cx_Oracle, or whether we really have to convert the object type from unicode to byte string, e.g.:
return str(unicodeStr)

amaltaro · 2019-05-17T10:17:41Z

test this please

vkuznet · 2019-05-17T11:23:18Z

Server/Python/src/dbs/business/DBSBlock.py

+    Utilitarian function which converts an unicode string to
+    an 8-bit string.
+    """
+    if isinstance(unicodeStr, basestring):


basestring is not supported in python3, instead better use str

Yes, but at least we know this is a case that will hit us again when we move to python3. So it's easier to identify when we start the real migration.

vkuznet · 2019-05-17T11:24:16Z

Server/Python/src/dbs/business/DBSBlock.py

+    an 8-bit string.
+    """
+    if isinstance(unicodeStr, basestring):
+        return unicodeStr.encode("utf-8")


why do we need to convert to utf-8 and not to ascii? Do we store dataset/block/lfn as utf8?

I guess it's a question to Yuyi. utf-8 supports ascii, so we should be fine as well (not sure about database performance though).

vkuznet

It seems to me that proper place to fix unicode->string conversion is at DAO level of DBS code and not here. For instance this code calls self.blocklist.execute function and pass dataset name, block name, etc. Why should we fix every occurrence of self.blocklist.execute instead of fixing the underlying function which perform SQL action. The code I'm talking about is here:
https://github.com/dmwm/DBS/tree/027b7ae0839e788c244277833c86b17a4aaecb91/Server/Python/src/dbs/dao/Oracle
and every subdirectory contains DAO object, e.g. Block/List.py execute function is defined here:

DBS/Server/Python/src/dbs/dao/Oracle/Block/List.py

Line 36 in 027b7ae

    
           def execute(self, conn, dataset="", block_name="", data_tier_name="", origin_site_name="", logical_file_name="",

My suggestion is to move convertByteStr into utils area of DBS code to make it re-usable by other modules and then fix underlying DAO object to convert given dataset/block/file names. Then the rest of the DBS code will be fixed automatically.

amaltaro

Thanks for the review, Valentin. I tried to keep a well performing code here, that's why an ad-hoc solution.

amaltaro · 2019-05-17T12:06:47Z

Server/Python/src/dbs/business/DBSBlock.py

+    Utilitarian function which converts an unicode string to
+    an 8-bit string.
+    """
+    if isinstance(unicodeStr, basestring):


Yes, but at least we know this is a case that will hit us again when we move to python3. So it's easier to identify when we start the real migration.

amaltaro · 2019-05-17T12:14:41Z

Server/Python/src/dbs/business/DBSBlock.py

+    an 8-bit string.
+    """
+    if isinstance(unicodeStr, basestring):
+        return unicodeStr.encode("utf-8")


I guess it's a question to Yuyi. utf-8 supports ascii, so we should be fine as well (not sure about database performance though).

vkuznet · 2019-05-17T12:32:02Z

I just want to provide a pointer to **exactly** the same problem I fixed in RESTModel code 3 years ago (I even put my comment starting with VK about it): dmwm/WMCore@b97d8ee#diff-27caaae36aa502e2719c31060d1bfa8fR237 So, on server side the input arguments should be converted from unicode to string.

…

On 0, Alan Malta Rodrigues ***@***.***> wrote: Fixes #605 Possible workaround for dealing with Unicode and byte strings inside the `dumpBlock` API. @yuyiguo I create a simple script querying a couple of DBS APIs, either with an unicode string as input, or as a byte string as input, and I don't see any issues. Did you manage to get the exact line where the SQL query hangs? I assume it's this one: https://github.com/dmwm/DBS/compare/master...amaltaro:unicode-blockDump?expand=1#diff-3c13637f700d283f5305fd63f2a8bc4aR130 ? PS.: This module has a mix of spaces and tabs, so I had to put many more changes in than what I wanted. You can view, comment on, or merge this pull request online at: #606 -- Commit Summary -- * Convert unicode to bytestring within the dumpBlock server API -- File Changes -- M Server/Python/src/dbs/business/DBSBlock.py (64) -- Patch Links -- https://github.com/dmwm/DBS/pull/606.patch https://github.com/dmwm/DBS/pull/606.diff -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: #606

yuyiguo · 2019-05-17T12:48:33Z

unicodeStr.encode("ascii") is enough.
I would prefer to do this in the global level instead of doing on DBS. I don't see cmsweb applications need any of the features of unicode. Tuning on individual application on this coding will make future development and maintains difficult.

use ascii encoding instead

amaltaro · 2019-05-17T13:48:51Z

Updated the code to ascii.
A global fix could be made at this level (when we parse all the binds):
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Database/DBCore.py#L123

it of course will add a global overhead to any service relying on the WMCore/Database package.
However, at least this issue wouldn't hit us again. In addition to that, we could then - probably -remove the sanitization Valentin put in at the WebTools level.

cmsdmwmbot · 2019-05-17T13:50:14Z

Jenkins results:

Unit tests: succeeded
Pylint check: failed
- 1 warnings and errors that must be fixed
- 2 warnings
- 80 comments to review
Pycodestyle check: succeeded
- 1 comments to review
Python3 compatibility checks: succeeded
- there are suggested fixes for newer python3 idioms

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-DBS-PR-test/112/artifact/artifacts/PullRequestReport.html

vkuznet · 2019-05-17T14:23:33Z

Alan, I agree this a proper place, even lower then I suggested at DAO level. And, yes it will introduce the overhead but indeed it will fix this problem for everything. Can someone benchmark the overhead. May be we can write C-extension for that to speed up things. V.

…

On 0, Alan Malta Rodrigues ***@***.***> wrote: Updated the code to `ascii`. A global fix could be made at this level (when we parse all the binds): https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Database/DBCore.py#L123 it of course will add a global overhead to any service relying on the WMCore/Database package. However, at least this issue wouldn't hit us again. In addition to that, we could then - probably -remove the sanitization Valentin put in at the WebTools level. -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: #606 (comment)

yuyiguo · 2019-05-17T14:29:49Z

Alan, Valentin,

I agreed that fixing should be at https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Database/DBCore.py#L123
too.

Alan, Do you want to make the PR since this is in WMCore land? Otherwise, I will do it.
Valentin, How do you want to branchmark it? I can blockdump a bunch of blocks to test it.
THanks,
Yuyi

amaltaro · 2019-05-17T14:35:50Z

Alan, Do you want to make the PR since this is in WMCore land?

Yes, I can work on it, but probably only on Monday. So if you feel like coming with a PR before that, please be my guest.

yuyiguo · 2019-05-17T14:45:41Z

Thanks Alan. Monday is fine with me so I leave this to you. I am working on throttling code that needs to going to June release. I hope it won't take to long to get it work properly. I am running off my time for CMS already.

vkuznet · 2019-05-17T14:46:22Z

Yuyi, if we're going to do this in DBCore.py#123 level it means that we need to benchmark binds dict. Then I would suggest take one LFN name, it does not matter which one, and create binds dict with bunch of identical LFN names as unicode. The size you need to test should correspond to largest block, i.e. find out how many LFNs we have in largest block and create a dict/list of LFNs for bind. Then take Alan's code and iterate binds dict with it converting all LFNs in it. Please note, that if we'll work with bind dicts we need to make in-place conversion, e.g. ``` # pseudo-code flow, adjust as necessary using proper bind dict structure time0 = time.time() for key in binds.keys(): binds[key] = convertBytesStr(binds[key]) print("elapsed time:", time.time()-time0) ``` But we should be careful since binds dict will be changed and if upstream code rely on it we should do something different. I want to avoid deep copy of the dict to perform conversion.

…

On 0, Yuyi Guo ***@***.***> wrote: Alan, Valentin, I agreed that fixing should be at https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Database/DBCore.py#L123 too. Alan, Do you want to make the PR since this is in WMCore land? Otherwise, I will do it. Valentin, How do you want to branchmark it? I can blockdump a bunch of blocks to test it. THanks, Yuyi -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: #606 (comment)

yuyiguo · 2019-05-17T14:58:53Z

Valentin,
The change will be made in DBCore.py#123. We do not need to do
binds[key] = convertBytesStr(binds[key]), right?
If we want to branchmark the unicode to byte string conversion, I just need to call listFileLumis API with a long list of lfn and counting the time. I could just comment out the DB part and return the time.
Yuyi

vkuznet · 2019-05-17T15:23:41Z

Yuyi, I'm not sure I understood your comment. I think we will do unicode to byte string conversion via `convertBytesStr` funcion. If so, you'll need to benchmark a loop with set of unicode strigns (LFNs) converting to string object. You may use whatever DBS API is appropriate for that only with reasonable (better largest) amount of LFNs we need to deal with.

…

On 0, Yuyi Guo ***@***.***> wrote: Valentin, The change will be made in DBCore.py#123. We do not need to do binds[key] = convertBytesStr(binds[key]), right? If we want to branchmark the unicode to byte string conversion, I just need to call listFileLumis API with a long list of lfn and counting the time. I could just comment out the DB part and return the time. Yuyi -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: #606 (comment)

vkuznet reviewed May 17, 2019

View reviewed changes

amaltaro mentioned this pull request May 17, 2019

MIgration failures #605

Closed

amaltaro commented May 17, 2019

View reviewed changes

Convert unicode to bytestring within the dumpBlock server API

5c4be59

use ascii encoding instead

amaltaro force-pushed the unicode-blockDump branch from 027b7ae to 5c4be59 Compare May 17, 2019 13:44

amaltaro mentioned this pull request May 17, 2019

Sanitize SQL input binds against unicode key and values dmwm/WMCore#9207

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert unicode to bytestring within the dumpBlock server API #606

Convert unicode to bytestring within the dumpBlock server API #606

amaltaro commented May 17, 2019

amaltaro commented May 17, 2019

amaltaro commented May 17, 2019

vkuznet May 17, 2019

amaltaro May 17, 2019

vkuznet May 17, 2019

amaltaro May 17, 2019

vkuznet left a comment •

edited

Loading

amaltaro left a comment

amaltaro May 17, 2019

amaltaro May 17, 2019

vkuznet commented May 17, 2019 via email

yuyiguo commented May 17, 2019

amaltaro commented May 17, 2019

cmsdmwmbot commented May 17, 2019

vkuznet commented May 17, 2019 via email

yuyiguo commented May 17, 2019

amaltaro commented May 17, 2019

yuyiguo commented May 17, 2019

vkuznet commented May 17, 2019 via email

yuyiguo commented May 17, 2019

vkuznet commented May 17, 2019 via email

Convert unicode to bytestring within the dumpBlock server API #606

Are you sure you want to change the base?

Convert unicode to bytestring within the dumpBlock server API #606

Conversation

amaltaro commented May 17, 2019

amaltaro commented May 17, 2019

amaltaro commented May 17, 2019

vkuznet May 17, 2019

Choose a reason for hiding this comment

amaltaro May 17, 2019

Choose a reason for hiding this comment

vkuznet May 17, 2019

Choose a reason for hiding this comment

amaltaro May 17, 2019

Choose a reason for hiding this comment

vkuznet left a comment • edited Loading

Choose a reason for hiding this comment

amaltaro left a comment

Choose a reason for hiding this comment

amaltaro May 17, 2019

Choose a reason for hiding this comment

amaltaro May 17, 2019

Choose a reason for hiding this comment

vkuznet commented May 17, 2019 via email

yuyiguo commented May 17, 2019

amaltaro commented May 17, 2019

cmsdmwmbot commented May 17, 2019

vkuznet commented May 17, 2019 via email

yuyiguo commented May 17, 2019

amaltaro commented May 17, 2019

yuyiguo commented May 17, 2019

vkuznet commented May 17, 2019 via email

yuyiguo commented May 17, 2019

vkuznet commented May 17, 2019 via email

vkuznet left a comment •

edited

Loading