Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert unicode to bytestring within the dumpBlock server API #606

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

amaltaro
Copy link
Contributor

Fixes #605
Possible workaround for dealing with Unicode and byte strings inside the dumpBlock API.

@yuyiguo I create a simple script querying a couple of DBS APIs, either with an unicode string as input, or as a byte string as input, and I don't see any issues.
Did you manage to get the exact line where the SQL query hangs? I assume it's this one:
https://github.com/dmwm/DBS/compare/master...amaltaro:unicode-blockDump?expand=1#diff-3c13637f700d283f5305fd63f2a8bc4aR130
?

PS.: This module has a mix of spaces and tabs, so I had to put many more changes in than what I wanted.

@amaltaro
Copy link
Contributor Author

Actually convertByteStr function will still return a unicode object type, but represented in 8-bits instead of 16 or 32. I'm not sure whether it is enough for cx_Oracle, or whether we really have to convert the object type from unicode to byte string, e.g.:
return str(unicodeStr)

@amaltaro
Copy link
Contributor Author

test this please

Utilitarian function which converts an unicode string to
an 8-bit string.
"""
if isinstance(unicodeStr, basestring):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basestring is not supported in python3, instead better use str

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but at least we know this is a case that will hit us again when we move to python3. So it's easier to identify when we start the real migration.

an 8-bit string.
"""
if isinstance(unicodeStr, basestring):
return unicodeStr.encode("utf-8")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to convert to utf-8 and not to ascii? Do we store dataset/block/lfn as utf8?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's a question to Yuyi. utf-8 supports ascii, so we should be fine as well (not sure about database performance though).

Copy link
Contributor

@vkuznet vkuznet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that proper place to fix unicode->string conversion is at DAO level of DBS code and not here. For instance this code calls self.blocklist.execute function and pass dataset name, block name, etc. Why should we fix every occurrence of self.blocklist.execute instead of fixing the underlying function which perform SQL action. The code I'm talking about is here:
https://github.com/dmwm/DBS/tree/027b7ae0839e788c244277833c86b17a4aaecb91/Server/Python/src/dbs/dao/Oracle
and every subdirectory contains DAO object, e.g. Block/List.py execute function is defined here:

def execute(self, conn, dataset="", block_name="", data_tier_name="", origin_site_name="", logical_file_name="",

My suggestion is to move convertByteStr into utils area of DBS code to make it re-usable by other modules and then fix underlying DAO object to convert given dataset/block/file names. Then the rest of the DBS code will be fixed automatically.

@amaltaro amaltaro mentioned this pull request May 17, 2019
Copy link
Contributor Author

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, Valentin. I tried to keep a well performing code here, that's why an ad-hoc solution.

Utilitarian function which converts an unicode string to
an 8-bit string.
"""
if isinstance(unicodeStr, basestring):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but at least we know this is a case that will hit us again when we move to python3. So it's easier to identify when we start the real migration.

an 8-bit string.
"""
if isinstance(unicodeStr, basestring):
return unicodeStr.encode("utf-8")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's a question to Yuyi. utf-8 supports ascii, so we should be fine as well (not sure about database performance though).

@vkuznet
Copy link
Contributor

vkuznet commented May 17, 2019 via email

@yuyiguo
Copy link
Member

yuyiguo commented May 17, 2019

unicodeStr.encode("ascii") is enough.
I would prefer to do this in the global level instead of doing on DBS. I don't see cmsweb applications need any of the features of unicode. Tuning on individual application on this coding will make future development and maintains difficult.

@amaltaro amaltaro force-pushed the unicode-blockDump branch from 027b7ae to 5c4be59 Compare May 17, 2019 13:44
@amaltaro
Copy link
Contributor Author

Updated the code to ascii.
A global fix could be made at this level (when we parse all the binds):
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Database/DBCore.py#L123

it of course will add a global overhead to any service relying on the WMCore/Database package.
However, at least this issue wouldn't hit us again. In addition to that, we could then - probably -remove the sanitization Valentin put in at the WebTools level.

@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
  • Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 2 warnings
    • 80 comments to review
  • Pycodestyle check: succeeded
    • 1 comments to review
  • Python3 compatibility checks: succeeded
    • there are suggested fixes for newer python3 idioms

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-DBS-PR-test/112/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor

vkuznet commented May 17, 2019 via email

@yuyiguo
Copy link
Member

yuyiguo commented May 17, 2019

Alan, Valentin,

I agreed that fixing should be at https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Database/DBCore.py#L123
too.

Alan, Do you want to make the PR since this is in WMCore land? Otherwise, I will do it.
Valentin, How do you want to branchmark it? I can blockdump a bunch of blocks to test it.
THanks,
Yuyi

@amaltaro
Copy link
Contributor Author

Alan, Do you want to make the PR since this is in WMCore land?

Yes, I can work on it, but probably only on Monday. So if you feel like coming with a PR before that, please be my guest.

@yuyiguo
Copy link
Member

yuyiguo commented May 17, 2019

Thanks Alan. Monday is fine with me so I leave this to you. I am working on throttling code that needs to going to June release. I hope it won't take to long to get it work properly. I am running off my time for CMS already.

@vkuznet
Copy link
Contributor

vkuznet commented May 17, 2019 via email

@yuyiguo
Copy link
Member

yuyiguo commented May 17, 2019

Valentin,
The change will be made in DBCore.py#123. We do not need to do
binds[key] = convertBytesStr(binds[key]), right?
If we want to branchmark the unicode to byte string conversion, I just need to call listFileLumis API with a long list of lfn and counting the time. I could just comment out the DB part and return the time.
Yuyi

@vkuznet
Copy link
Contributor

vkuznet commented May 17, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MIgration failures
4 participants