-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add some real EBCDIC testing #656
base: master
Are you sure you want to change the base?
Conversation
fb1c448
to
4b9f908
Compare
I've been undecided about what's the best test suite to run on the EBCDIC build. Here's my decision: I want to run the full test suite (1 to 27), but obviously there will be some failures. I'll introduce a new "#ifndef EBCDIC" directive in the test input/output files, to mask out the specific tests that fail in EBCDIC mode. These will be ones of the form I basically don't want to add a new "testoutputN-EBC" for every testoutput file (too verbose when writing tests). Nor do I want to just skip all the existing tests. Nor can we somehow maintain a list of "expected failures" under EBCDIC. I'll also rename the existing "testinputEBC" to "testinput28" and just make it conditional the EBCDIC build, in the same way that lots of existing tests are conditional on build properties. |
Is it possible to really test EBCDIC on ASCII? I hope this will not make our life difficult. |
Yes, it's totally possible. It's in the PR currently, I just haven't updated the RunTest script to get all the tests passing. It's also necessary to test the EBCDIC code - I just don't feel comfortable shipping lines of code that we haven't compiled or tested. |
06219fc
to
fdeb41f
Compare
f9ca812
to
aea53cd
Compare
71351f4
to
ec9fdc8
Compare
2bc730f
to
8613839
Compare
b1d55a2
to
775385d
Compare
@zherczeg I'm proud that I managed to make some changes to the JIT code in this PR! I reckon the EBCDIC code has never been tested well. I wonder if anyone uses it at all :( I have run the full test suite against it in this PR, with and without JIT, and uncovered numerous bugs. This PR is huge - please be selective in what you review, and you don't need to look at every file! But I would appreciate a look at the JIT changes in particular, and maybe the small bugfixes I've done on the core There is lots of spam in the testdata, and files like |
7d4555a
to
da2a2f8
Compare
src/pcre2_compile_class.c
Outdated
@@ -1280,8 +1280,23 @@ while (TRUE) | |||
value of 1 removes vertical space and 2 removes underscore. */ | |||
|
|||
if (tabopt < 0) tabopt = -tabopt; | |||
#ifdef EBCDIC | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong indentation for braces.
@@ -188,7 +216,7 @@ while (plength > 0) | |||
switch (posix_state) | |||
{ | |||
case POSIX_CLASS_STARTED: | |||
if (c <= 127 && islower(c)) break; /* Remain in started state */ | |||
if (ISLOWER(c)) break; /* Remain in started state */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually the comparison with 127 is present, because it is compatible with utf8. Changing it to 256 might have side effects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ISLOWER macro is explicit above. The check for <127 is only needed if calling the C function islower()
. In EBCDIC, the Latin alphabet occupies values >127.
@@ -1966,9 +1966,9 @@ switch(type) | |||
detect_partial_match(common, backtracks); | |||
|
|||
if (type == OP_NOT_HSPACE) | |||
read_char(common, 0x9, 0x3000, backtracks, READ_CHAR_UPDATE_STR_PTR); | |||
read_char(common, 0x1, 0x3000, backtracks, READ_CHAR_UPDATE_STR_PTR); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this could be changed to 0x0, since one byte utf8 characters may match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure why there was a min/max at all. I change it to min=0x1, max=0x3000
to match some other calls in the file. What I do know is that the min/max values here need to chosen so that the HSPACE and VSPACE characters are included in the range, and in EBCDIC, \t
is 0x5, which was causing test failures with the old code.
OP2U(SLJIT_SUB | SLJIT_SET_LESS_EQUAL, TMP1, 0, SLJIT_IMM, 0x0d - 0x0a); | ||
#ifdef EBCDIC | ||
OP2U(SLJIT_SUB | SLJIT_SET_Z, TMP1, 0, SLJIT_IMM, CHAR_LF); | ||
OP_FLAGS(SLJIT_MOV, TMP2, 0, SLJIT_EQUAL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can these be organized in some way to groups? Maybe a static const sljit_u8
bitset could be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. The ASCII code is basically if (input >= '\n' && input <= '\r')
, and I've added an EBCDIC version which doesn't assume that the characters are consecutive, in the form if (input == '\n' || input == '\v' || input == '\f' || input == '\r')
.
Would you really prefer a 32-byte bitset instead? It would be a bit of a pain to construct the bitset literal from the CHAR values. I'll happily do it if you say so.
I don't care about "performance", I only want to get the tests passing.
I know this was a large amount of work. However, I have doubts about the usefulness of EBCDIC support. It looks like it is a large maintenance burden (#if !ebcdic macros for example). On ascii systems, there is no point on using EBCDIC. I think EBCDIC is only used on some exotic IBM systems (IBM Z), where jit is not available anyway. In the past EBCDIC was maintained in a separate tree. While some changes are useful for the main code base, I would prefer to keep it that way. There was a guy who maintained the EBCDIC code before, but he disappeared, and nobody came afterwards. I would wait until somebody who actually uses it appears, and decide things afterwards. |
I believe that the EBCDIC code (excluding JIT) is used in the port to IBM's mainframe Z/OS system that is maintained by Ze'ev Atlas. I assume he is still around, though he hasn't posted for a while. Historically, the EBCDIC support code was originally supplied by a user who wanted it, back in 2003 (says the PCRE1 ChangeLog). |
Yes! I think I agree with almost everything you say. It really is almost useless. I wish I knew if anyone at all in the world is using PCRE2 on these systems.
I agree it's pretty painful. However, the current PR is simply paying back technical debt, and there shouldn't much ongoing cost after this is merged. For what it's worth, there are 19,000 test regexes in
Agreed. Except for the one fact that we need to test this code. I basically see the existing EBCDIC code as a big piece of tech debt: it's code that we are shipping in every release, and our README has claimed that it's fully supported. But we don't compile it, we don't test it, and don't even know if it works. In fact, as I found in this PR, we have quite a number of existing bugs. For me, I feel that this PR is forced on us. We must do something:
In the long term, we can't just carry on shipping something that we can't verify.
Yes, it's just IBM systems. They are still being sold, and are widely used in some industries still (eg finance, banking). I have never seen or used one of these systems myself. I expect JIT does work on these systems. They are fully UNIX-compliant, and should work with I feel bad about this PR. I hate to waste time on something dead. But, I'm not confident to simply delete all the EBCDIC code. Nor is it acceptable to keep shipping it forever, when we know (now) that it's bitrotted and buggy and isn't tested. |
Thank you for that explanation Philip.
I think we should actually delete the "native z/OS" code in PCRE2. And also the VMS code. Those users can keep a GitHub fork of PCRE2 with their code and do a "git merge" anytime they want to update to the latest release. Was Ze'ev's code even used in any production systems? Some of his messages make it sound like it was a side-project or hobby. Here's something which looks like it's actually used: https://github.com/zopencommunity/libpcre2port/tree/main It's a version of PCRE2 which is built and maintained by actual IBM employees, so that z/OS customers can use software like R, PHP, Julia etc which depend on PCRE2. |
It is unlikely. According to wikipedia ( https://en.wikipedia.org/wiki/IBM_Z ), they use Telum CPUs. I am sure JIT does not work on them, since they are not s390x / POWER compatible systems (why they maintain that many architectures?). Maybe discussing this with IBM first would be better. It is ok for me if we drop EBCDIC support as well. |
You think that Telum is not s390x? I thought that all that S390x stuff was precisely to support the IBM Z. I think "Telum" is maybe a brand name (like "Pentium") and it uses the ISA "z/Architecture". According to this Qemu their s390x supports all the IBM Z mainframe CPUs. I don't know anything for sure though.
Good idea. I can always ask. I'm cautious of doing extra work however :( |
If Z systems would be s390x, then jit would support it, and they would encounter all kinds of errors, because their character set is not supported. Or perhaps they never tried to enable the jit support, then they probably don't need it at all. I got access to s390x hardware from IBM, but it runs a standard Linux virtual machine, no EBCDIC. We can definitely talk to them and get a clearer picture. |
The hardware isn't EBCDIC or ASCII, it's just hardware. The OS that runs on it has a convention for whether files are to be interpreted as EBCDIC or ASCII - but it's just a convention. Bytes are bytes. You can read EBCDIC data on an "ASCII" OS. The only thing that changes is what encoding is expected by system utilities which read and write data from files or streams, and what numeric values the compiler assigns to C character literals. Linux is always ASCII - even on IBM Z hardware. I'll ask the IBM people whether they use EBCDIC. |
You could also ask Ze'ev about his port to "native Z/OS" which he implied, when he first did it, was wanted by some people who ran "native Z/OS" without the Linux compatibility features. There are perhaps some "hard core" users of this type? Back in the 70's and 80's I did work on such systems, so I have a vague memory of that environment. |
Part of #655
We currently don't have any testing in CI whatsoever for the EBCDIC code. I know it's obscure, but I feel it ought be tested at least a little before we put out a release.
I'm making it so that you can now compile a version of the PCRE2 libraries which fully-support EBCDIC, on any platform, with any compiler.
This lets us now write some proper EBCDIC tests and run them in CI.