Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some real EBCDIC testing #656

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open

Conversation

NWilson
Copy link
Member

@NWilson NWilson commented Jan 7, 2025

Part of #655

We currently don't have any testing in CI whatsoever for the EBCDIC code. I know it's obscure, but I feel it ought be tested at least a little before we put out a release.

I'm making it so that you can now compile a version of the PCRE2 libraries which fully-support EBCDIC, on any platform, with any compiler.

This lets us now write some proper EBCDIC tests and run them in CI.

@NWilson NWilson force-pushed the user/niwilson/ebcdic-test branch 2 times, most recently from fb1c448 to 4b9f908 Compare January 7, 2025 17:29
@NWilson
Copy link
Member Author

NWilson commented Jan 13, 2025

I've been undecided about what's the best test suite to run on the EBCDIC build.

Here's my decision: I want to run the full test suite (1 to 27), but obviously there will be some failures. I'll introduce a new "#ifndef EBCDIC" directive in the test input/output files, to mask out the specific tests that fail in EBCDIC mode. These will be ones of the form does [\x10-\x60] match 'A' and similar things.

I basically don't want to add a new "testoutputN-EBC" for every testoutput file (too verbose when writing tests). Nor do I want to just skip all the existing tests. Nor can we somehow maintain a list of "expected failures" under EBCDIC.

I'll also rename the existing "testinputEBC" to "testinput28" and just make it conditional the EBCDIC build, in the same way that lots of existing tests are conditional on build properties.

@zherczeg
Copy link
Collaborator

Is it possible to really test EBCDIC on ASCII? I hope this will not make our life difficult.

@NWilson
Copy link
Member Author

NWilson commented Jan 13, 2025

Yes, it's totally possible. It's in the PR currently, I just haven't updated the RunTest script to get all the tests passing.

It's also necessary to test the EBCDIC code - I just don't feel comfortable shipping lines of code that we haven't compiled or tested.

@NWilson NWilson force-pushed the user/niwilson/ebcdic-test branch 2 times, most recently from 06219fc to fdeb41f Compare January 13, 2025 22:13
@NWilson NWilson force-pushed the user/niwilson/ebcdic-test branch from f9ca812 to aea53cd Compare January 16, 2025 12:37
@NWilson NWilson force-pushed the user/niwilson/ebcdic-test branch from 71351f4 to ec9fdc8 Compare January 16, 2025 17:27
@NWilson NWilson force-pushed the user/niwilson/ebcdic-test branch from 2bc730f to 8613839 Compare January 17, 2025 16:19
@NWilson NWilson force-pushed the user/niwilson/ebcdic-test branch from b1d55a2 to 775385d Compare January 17, 2025 16:45
@NWilson NWilson marked this pull request as ready for review January 17, 2025 16:47
@NWilson NWilson requested a review from zherczeg February 5, 2025 17:45
@NWilson
Copy link
Member Author

NWilson commented Feb 5, 2025

@zherczeg I'm proud that I managed to make some changes to the JIT code in this PR!

I reckon the EBCDIC code has never been tested well. I wonder if anyone uses it at all :(

I have run the full test suite against it in this PR, with and without JIT, and uncovered numerous bugs.

This PR is huge - please be selective in what you review, and you don't need to look at every file! But I would appreciate a look at the JIT changes in particular, and maybe the small bugfixes I've done on the core compile.c and interpreter functions.

There is lots of spam in the testdata, and files like pcre2test.c. No need to look if you don't want to.

@NWilson NWilson force-pushed the user/niwilson/ebcdic-test branch from 7d4555a to da2a2f8 Compare February 5, 2025 17:50
@@ -1280,8 +1280,23 @@ while (TRUE)
value of 1 removes vertical space and 2 removes underscore. */

if (tabopt < 0) tabopt = -tabopt;
#ifdef EBCDIC
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong indentation for braces.

@@ -188,7 +216,7 @@ while (plength > 0)
switch (posix_state)
{
case POSIX_CLASS_STARTED:
if (c <= 127 && islower(c)) break; /* Remain in started state */
if (ISLOWER(c)) break; /* Remain in started state */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually the comparison with 127 is present, because it is compatible with utf8. Changing it to 256 might have side effects.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ISLOWER macro is explicit above. The check for <127 is only needed if calling the C function islower(). In EBCDIC, the Latin alphabet occupies values >127.

@@ -1966,9 +1966,9 @@ switch(type)
detect_partial_match(common, backtracks);

if (type == OP_NOT_HSPACE)
read_char(common, 0x9, 0x3000, backtracks, READ_CHAR_UPDATE_STR_PTR);
read_char(common, 0x1, 0x3000, backtracks, READ_CHAR_UPDATE_STR_PTR);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this could be changed to 0x0, since one byte utf8 characters may match.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure why there was a min/max at all. I change it to min=0x1, max=0x3000 to match some other calls in the file. What I do know is that the min/max values here need to chosen so that the HSPACE and VSPACE characters are included in the range, and in EBCDIC, \t is 0x5, which was causing test failures with the old code.

OP2U(SLJIT_SUB | SLJIT_SET_LESS_EQUAL, TMP1, 0, SLJIT_IMM, 0x0d - 0x0a);
#ifdef EBCDIC
OP2U(SLJIT_SUB | SLJIT_SET_Z, TMP1, 0, SLJIT_IMM, CHAR_LF);
OP_FLAGS(SLJIT_MOV, TMP2, 0, SLJIT_EQUAL);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these be organized in some way to groups? Maybe a static const sljit_u8 bitset could be better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. The ASCII code is basically if (input >= '\n' && input <= '\r'), and I've added an EBCDIC version which doesn't assume that the characters are consecutive, in the form if (input == '\n' || input == '\v' || input == '\f' || input == '\r').

Would you really prefer a 32-byte bitset instead? It would be a bit of a pain to construct the bitset literal from the CHAR values. I'll happily do it if you say so.

I don't care about "performance", I only want to get the tests passing.

@zherczeg
Copy link
Collaborator

zherczeg commented Feb 6, 2025

I know this was a large amount of work. However, I have doubts about the usefulness of EBCDIC support. It looks like it is a large maintenance burden (#if !ebcdic macros for example). On ascii systems, there is no point on using EBCDIC. I think EBCDIC is only used on some exotic IBM systems (IBM Z), where jit is not available anyway.

In the past EBCDIC was maintained in a separate tree. While some changes are useful for the main code base, I would prefer to keep it that way. There was a guy who maintained the EBCDIC code before, but he disappeared, and nobody came afterwards. I would wait until somebody who actually uses it appears, and decide things afterwards.

@PhilipHazel
Copy link
Collaborator

I believe that the EBCDIC code (excluding JIT) is used in the port to IBM's mainframe Z/OS system that is maintained by Ze'ev Atlas. I assume he is still around, though he hasn't posted for a while. Historically, the EBCDIC support code was originally supplied by a user who wanted it, back in 2003 (says the PCRE1 ChangeLog).

@NWilson
Copy link
Member Author

NWilson commented Feb 6, 2025

However, I have doubts about the usefulness of EBCDIC support.

Yes! I think I agree with almost everything you say. It really is almost useless. I wish I knew if anyone at all in the world is using PCRE2 on these systems.

It looks like it is a large maintenance burden (#if !ebcdic macros for example).

I agree it's pretty painful. However, the current PR is simply paying back technical debt, and there shouldn't much ongoing cost after this is merged. For what it's worth, there are 19,000 test regexes in testdata/, and this PR adds 112 #if exclusions for EBCDIC (0.5%). Once we've committed these 112 test exclusions, we can forget about EBCDIC and get back to normal life.

On ascii systems, there is no point on using EBCDIC.

Agreed. Except for the one fact that we need to test this code.

I basically see the existing EBCDIC code as a big piece of tech debt: it's code that we are shipping in every release, and our README has claimed that it's fully supported. But we don't compile it, we don't test it, and don't even know if it works. In fact, as I found in this PR, we have quite a number of existing bugs.

For me, I feel that this PR is forced on us. We must do something:

  • Delete the EBCDIC code, and stop claiming we support it
  • Or, run the unit tests, and fix the bugs we find.

In the long term, we can't just carry on shipping something that we can't verify.

I think EBCDIC is only used on some exotic IBM systems (IBM Z), where jit is not available anyway.

Yes, it's just IBM systems. They are still being sold, and are widely used in some industries still (eg finance, banking). I have never seen or used one of these systems myself. I expect JIT does work on these systems. They are fully UNIX-compliant, and should work with ./configure && make just like any other UNIX. (Ze'ev's port to "native z/OS" is really really weird. I believe no-one is using PCRE2 in that way. It's a complete distraction.)


I feel bad about this PR. I hate to waste time on something dead. But, I'm not confident to simply delete all the EBCDIC code. Nor is it acceptable to keep shipping it forever, when we know (now) that it's bitrotted and buggy and isn't tested.

@NWilson
Copy link
Member Author

NWilson commented Feb 6, 2025

Thank you for that explanation Philip.

I believe that the EBCDIC code (excluding JIT) is used in the port to IBM's mainframe Z/OS system that is maintained by Ze'ev Atlas.

I think we should actually delete the "native z/OS" code in PCRE2. And also the VMS code. Those users can keep a GitHub fork of PCRE2 with their code and do a "git merge" anytime they want to update to the latest release. Was Ze'ev's code even used in any production systems? Some of his messages make it sound like it was a side-project or hobby.

Here's something which looks like it's actually used: https://github.com/zopencommunity/libpcre2port/tree/main

It's a version of PCRE2 which is built and maintained by actual IBM employees, so that z/OS customers can use software like R, PHP, Julia etc which depend on PCRE2.

@zherczeg
Copy link
Collaborator

zherczeg commented Feb 6, 2025

I expect JIT does work on these systems.

It is unlikely. According to wikipedia ( https://en.wikipedia.org/wiki/IBM_Z ), they use Telum CPUs. I am sure JIT does not work on them, since they are not s390x / POWER compatible systems (why they maintain that many architectures?).

Maybe discussing this with IBM first would be better.

It is ok for me if we drop EBCDIC support as well.

@NWilson
Copy link
Member Author

NWilson commented Feb 6, 2025

It is unlikely. According to wikipedia ( https://en.wikipedia.org/wiki/IBM_Z ), they use Telum CPUs. I am sure JIT does not work on them, since they are not s390x / POWER compatible systems (why they maintain that many architectures?).

You think that Telum is not s390x? I thought that all that S390x stuff was precisely to support the IBM Z. I think "Telum" is maybe a brand name (like "Pentium") and it uses the ISA "z/Architecture". According to this Qemu their s390x supports all the IBM Z mainframe CPUs.

I don't know anything for sure though.

Maybe discussing this with IBM first would be better.

Good idea. I can always ask. I'm cautious of doing extra work however :(

@zherczeg
Copy link
Collaborator

zherczeg commented Feb 6, 2025

If Z systems would be s390x, then jit would support it, and they would encounter all kinds of errors, because their character set is not supported. Or perhaps they never tried to enable the jit support, then they probably don't need it at all. I got access to s390x hardware from IBM, but it runs a standard Linux virtual machine, no EBCDIC. We can definitely talk to them and get a clearer picture.

@NWilson
Copy link
Member Author

NWilson commented Feb 6, 2025

The hardware isn't EBCDIC or ASCII, it's just hardware. The OS that runs on it has a convention for whether files are to be interpreted as EBCDIC or ASCII - but it's just a convention. Bytes are bytes. You can read EBCDIC data on an "ASCII" OS. The only thing that changes is what encoding is expected by system utilities which read and write data from files or streams, and what numeric values the compiler assigns to C character literals.

Linux is always ASCII - even on IBM Z hardware.

I'll ask the IBM people whether they use EBCDIC.

@PhilipHazel
Copy link
Collaborator

You could also ask Ze'ev about his port to "native Z/OS" which he implied, when he first did it, was wanted by some people who ran "native Z/OS" without the Linux compatibility features. There are perhaps some "hard core" users of this type? Back in the 70's and 80's I did work on such systems, so I have a vague memory of that environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants