Add some real EBCDIC testing #656

NWilson · 2025-01-07T11:30:58Z

Part of #655

We currently don't have any testing in CI whatsoever for the EBCDIC code. I know it's obscure, but I feel it ought be tested at least a little before we put out a release.

I'm making it so that you can now compile a version of the PCRE2 libraries which fully-support EBCDIC, on any platform, with any compiler.

This lets us now write some proper EBCDIC tests and run them in CI.

NWilson · 2025-01-13T09:40:18Z

I've been undecided about what's the best test suite to run on the EBCDIC build.

Here's my decision: I want to run the full test suite (1 to 27), but obviously there will be some failures. I'll introduce a new "#ifndef EBCDIC" directive in the test input/output files, to mask out the specific tests that fail in EBCDIC mode. These will be ones of the form does [\x10-\x60] match 'A' and similar things.

I basically don't want to add a new "testoutputN-EBC" for every testoutput file (too verbose when writing tests). Nor do I want to just skip all the existing tests. Nor can we somehow maintain a list of "expected failures" under EBCDIC.

I'll also rename the existing "testinputEBC" to "testinput28" and just make it conditional the EBCDIC build, in the same way that lots of existing tests are conditional on build properties.

zherczeg · 2025-01-13T09:58:53Z

Is it possible to really test EBCDIC on ASCII? I hope this will not make our life difficult.

NWilson · 2025-01-13T10:02:22Z

Yes, it's totally possible. It's in the PR currently, I just haven't updated the RunTest script to get all the tests passing.

It's also necessary to test the EBCDIC code - I just don't feel comfortable shipping lines of code that we haven't compiled or tested.

NWilson · 2025-02-05T17:48:36Z

@zherczeg I'm proud that I managed to make some changes to the JIT code in this PR!

I reckon the EBCDIC code has never been tested well. I wonder if anyone uses it at all :(

I have run the full test suite against it in this PR, with and without JIT, and uncovered numerous bugs.

This PR is huge - please be selective in what you review, and you don't need to look at every file! But I would appreciate a look at the JIT changes in particular, and maybe the small bugfixes I've done on the core compile.c and interpreter functions.

There is lots of spam in the testdata, and files like pcre2test.c. No need to look if you don't want to.

…c-test

zherczeg · 2025-02-06T05:26:45Z

src/pcre2_compile_class.c

@@ -1280,8 +1280,23 @@ while (TRUE)
    value of 1 removes vertical space and 2 removes underscore. */

    if (tabopt < 0) tabopt = -tabopt;
+#ifdef EBCDIC
+    {


Wrong indentation for braces.

zherczeg · 2025-02-06T05:28:56Z

src/pcre2_convert.c

@@ -188,7 +216,7 @@ while (plength > 0)
      switch (posix_state)
        {
        case POSIX_CLASS_STARTED:
-        if (c <= 127 && islower(c)) break;  /* Remain in started state */
+        if (ISLOWER(c)) break;  /* Remain in started state */


Usually the comparison with 127 is present, because it is compatible with utf8. Changing it to 256 might have side effects.

The ISLOWER macro is explicit above. The check for <127 is only needed if calling the C function islower(). In EBCDIC, the Latin alphabet occupies values >127.

zherczeg · 2025-02-06T05:31:34Z

src/pcre2_jit_char_inc.h

@@ -1966,9 +1966,9 @@ switch(type)
    detect_partial_match(common, backtracks);

  if (type == OP_NOT_HSPACE)
-    read_char(common, 0x9, 0x3000, backtracks, READ_CHAR_UPDATE_STR_PTR);
+    read_char(common, 0x1, 0x3000, backtracks, READ_CHAR_UPDATE_STR_PTR);


Maybe this could be changed to 0x0, since one byte utf8 characters may match.

I wasn't sure why there was a min/max at all. I change it to min=0x1, max=0x3000 to match some other calls in the file. What I do know is that the min/max values here need to chosen so that the HSPACE and VSPACE characters are included in the range, and in EBCDIC, \t is 0x5, which was causing test failures with the old code.

zherczeg · 2025-02-06T05:33:14Z

src/pcre2_jit_compile.c

-OP2U(SLJIT_SUB | SLJIT_SET_LESS_EQUAL, TMP1, 0, SLJIT_IMM, 0x0d - 0x0a);
+#ifdef EBCDIC
+OP2U(SLJIT_SUB | SLJIT_SET_Z, TMP1, 0, SLJIT_IMM, CHAR_LF);
+OP_FLAGS(SLJIT_MOV, TMP2, 0, SLJIT_EQUAL);


Can these be organized in some way to groups? Maybe a static const sljit_u8 bitset could be better.

Hmm. The ASCII code is basically if (input >= '\n' && input <= '\r'), and I've added an EBCDIC version which doesn't assume that the characters are consecutive, in the form if (input == '\n' || input == '\v' || input == '\f' || input == '\r').

Would you really prefer a 32-byte bitset instead? It would be a bit of a pain to construct the bitset literal from the CHAR values. I'll happily do it if you say so.

I don't care about "performance", I only want to get the tests passing.

zherczeg · 2025-02-06T05:45:57Z

I know this was a large amount of work. However, I have doubts about the usefulness of EBCDIC support. It looks like it is a large maintenance burden (#if !ebcdic macros for example). On ascii systems, there is no point on using EBCDIC. I think EBCDIC is only used on some exotic IBM systems (IBM Z), where jit is not available anyway.

In the past EBCDIC was maintained in a separate tree. While some changes are useful for the main code base, I would prefer to keep it that way. There was a guy who maintained the EBCDIC code before, but he disappeared, and nobody came afterwards. I would wait until somebody who actually uses it appears, and decide things afterwards.

PhilipHazel · 2025-02-06T08:19:07Z

I believe that the EBCDIC code (excluding JIT) is used in the port to IBM's mainframe Z/OS system that is maintained by Ze'ev Atlas. I assume he is still around, though he hasn't posted for a while. Historically, the EBCDIC support code was originally supplied by a user who wanted it, back in 2003 (says the PCRE1 ChangeLog).

NWilson · 2025-02-06T10:17:24Z

However, I have doubts about the usefulness of EBCDIC support.

Yes! I think I agree with almost everything you say. It really is almost useless. I wish I knew if anyone at all in the world is using PCRE2 on these systems.

It looks like it is a large maintenance burden (#if !ebcdic macros for example).

I agree it's pretty painful. However, the current PR is simply paying back technical debt, and there shouldn't much ongoing cost after this is merged. For what it's worth, there are 19,000 test regexes in testdata/, and this PR adds 112 #if exclusions for EBCDIC (0.5%). Once we've committed these 112 test exclusions, we can forget about EBCDIC and get back to normal life.

On ascii systems, there is no point on using EBCDIC.

Agreed. Except for the one fact that we need to test this code.

I basically see the existing EBCDIC code as a big piece of tech debt: it's code that we are shipping in every release, and our README has claimed that it's fully supported. But we don't compile it, we don't test it, and don't even know if it works. In fact, as I found in this PR, we have quite a number of existing bugs.

For me, I feel that this PR is forced on us. We must do something:

Delete the EBCDIC code, and stop claiming we support it
Or, run the unit tests, and fix the bugs we find.

In the long term, we can't just carry on shipping something that we can't verify.

I think EBCDIC is only used on some exotic IBM systems (IBM Z), where jit is not available anyway.

Yes, it's just IBM systems. They are still being sold, and are widely used in some industries still (eg finance, banking). I have never seen or used one of these systems myself. I expect JIT does work on these systems. They are fully UNIX-compliant, and should work with ./configure && make just like any other UNIX. (Ze'ev's port to "native z/OS" is really really weird. I believe no-one is using PCRE2 in that way. It's a complete distraction.)

I feel bad about this PR. I hate to waste time on something dead. But, I'm not confident to simply delete all the EBCDIC code. Nor is it acceptable to keep shipping it forever, when we know (now) that it's bitrotted and buggy and isn't tested.

NWilson · 2025-02-06T10:26:07Z

Thank you for that explanation Philip.

I believe that the EBCDIC code (excluding JIT) is used in the port to IBM's mainframe Z/OS system that is maintained by Ze'ev Atlas.

I think we should actually delete the "native z/OS" code in PCRE2. And also the VMS code. Those users can keep a GitHub fork of PCRE2 with their code and do a "git merge" anytime they want to update to the latest release. Was Ze'ev's code even used in any production systems? Some of his messages make it sound like it was a side-project or hobby.

Here's something which looks like it's actually used: https://github.com/zopencommunity/libpcre2port/tree/main

It's a version of PCRE2 which is built and maintained by actual IBM employees, so that z/OS customers can use software like R, PHP, Julia etc which depend on PCRE2.

zherczeg · 2025-02-06T11:40:17Z

I expect JIT does work on these systems.

It is unlikely. According to wikipedia ( https://en.wikipedia.org/wiki/IBM_Z ), they use Telum CPUs. I am sure JIT does not work on them, since they are not s390x / POWER compatible systems (why they maintain that many architectures?).

Maybe discussing this with IBM first would be better.

It is ok for me if we drop EBCDIC support as well.

NWilson · 2025-02-06T11:57:49Z

It is unlikely. According to wikipedia ( https://en.wikipedia.org/wiki/IBM_Z ), they use Telum CPUs. I am sure JIT does not work on them, since they are not s390x / POWER compatible systems (why they maintain that many architectures?).

You think that Telum is not s390x? I thought that all that S390x stuff was precisely to support the IBM Z. I think "Telum" is maybe a brand name (like "Pentium") and it uses the ISA "z/Architecture". According to this Qemu their s390x supports all the IBM Z mainframe CPUs.

I don't know anything for sure though.

Maybe discussing this with IBM first would be better.

Good idea. I can always ask. I'm cautious of doing extra work however :(

zherczeg · 2025-02-06T12:45:28Z

If Z systems would be s390x, then jit would support it, and they would encounter all kinds of errors, because their character set is not supported. Or perhaps they never tried to enable the jit support, then they probably don't need it at all. I got access to s390x hardware from IBM, but it runs a standard Linux virtual machine, no EBCDIC. We can definitely talk to them and get a clearer picture.

NWilson · 2025-02-06T14:02:29Z

The hardware isn't EBCDIC or ASCII, it's just hardware. The OS that runs on it has a convention for whether files are to be interpreted as EBCDIC or ASCII - but it's just a convention. Bytes are bytes. You can read EBCDIC data on an "ASCII" OS. The only thing that changes is what encoding is expected by system utilities which read and write data from files or streams, and what numeric values the compiler assigns to C character literals.

Linux is always ASCII - even on IBM Z hardware.

I'll ask the IBM people whether they use EBCDIC.

PhilipHazel · 2025-02-06T16:54:33Z

You could also ask Ze'ev about his port to "native Z/OS" which he implied, when he first did it, was wanted by some people who ran "native Z/OS" without the Linux compatibility features. There are perhaps some "hard core" users of this type? Back in the 70's and 80's I did work on such systems, so I have a vague memory of that environment.

NWilson force-pushed the user/niwilson/ebcdic-test branch 2 times, most recently from fb1c448 to 4b9f908 Compare January 7, 2025 17:29

NWilson force-pushed the user/niwilson/ebcdic-test branch 2 times, most recently from 06219fc to fdeb41f Compare January 13, 2025 22:13

Add real EBCDIC build and tests

aea53cd

NWilson force-pushed the user/niwilson/ebcdic-test branch from f9ca812 to aea53cd Compare January 16, 2025 12:37

Some further fixes

ec9fdc8

NWilson force-pushed the user/niwilson/ebcdic-test branch from 71351f4 to ec9fdc8 Compare January 16, 2025 17:27

Last-ish round of fixes

8613839

NWilson force-pushed the user/niwilson/ebcdic-test branch from 2bc730f to 8613839 Compare January 17, 2025 16:19

github-actions bot and others added 2 commits January 17, 2025 16:19

Sync autogenerated files #noupdate

2863c9e

Small build fixes

775385d

NWilson force-pushed the user/niwilson/ebcdic-test branch from b1d55a2 to 775385d Compare January 17, 2025 16:45

NWilson marked this pull request as ready for review January 17, 2025 16:47

NWilson and others added 4 commits January 17, 2025 19:32

Update escaping used in test 3

48dcded

Merge branch 'master' into user/niwilson/ebcdic-test

20d81e3

Fix line-wrapping in testoutput3 files

adb7c8a

Sync autogenerated files #noupdate

30b9169

NWilson requested a review from zherczeg February 5, 2025 17:45

NWilson added 2 commits February 5, 2025 17:50

Update README with new tests

87ef192

Merge remote-tracking branch 'origin/master' into user/niwilson/ebcdi…

da2a2f8

…c-test

NWilson force-pushed the user/niwilson/ebcdic-test branch from 7d4555a to da2a2f8 Compare February 5, 2025 17:50

zherczeg reviewed Feb 6, 2025

View reviewed changes

Fix indentation

3bd6eec

NWilson mentioned this pull request Feb 6, 2025

Coordination with upstream PCRE2 zopencommunity/libpcre2port#11

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add some real EBCDIC testing #656

Add some real EBCDIC testing #656

NWilson commented Jan 7, 2025

NWilson commented Jan 13, 2025

zherczeg commented Jan 13, 2025

NWilson commented Jan 13, 2025

NWilson commented Feb 5, 2025

zherczeg Feb 6, 2025

zherczeg Feb 6, 2025

NWilson Feb 6, 2025

zherczeg Feb 6, 2025

NWilson Feb 6, 2025

zherczeg Feb 6, 2025

NWilson Feb 6, 2025

zherczeg commented Feb 6, 2025

PhilipHazel commented Feb 6, 2025

NWilson commented Feb 6, 2025

NWilson commented Feb 6, 2025

zherczeg commented Feb 6, 2025

NWilson commented Feb 6, 2025

zherczeg commented Feb 6, 2025

NWilson commented Feb 6, 2025

PhilipHazel commented Feb 6, 2025

Add some real EBCDIC testing #656

Are you sure you want to change the base?

Add some real EBCDIC testing #656

Conversation

NWilson commented Jan 7, 2025

NWilson commented Jan 13, 2025

zherczeg commented Jan 13, 2025

NWilson commented Jan 13, 2025

NWilson commented Feb 5, 2025

zherczeg Feb 6, 2025

Choose a reason for hiding this comment

zherczeg Feb 6, 2025

Choose a reason for hiding this comment

NWilson Feb 6, 2025

Choose a reason for hiding this comment

zherczeg Feb 6, 2025

Choose a reason for hiding this comment

NWilson Feb 6, 2025

Choose a reason for hiding this comment

zherczeg Feb 6, 2025

Choose a reason for hiding this comment

NWilson Feb 6, 2025

Choose a reason for hiding this comment

zherczeg commented Feb 6, 2025

PhilipHazel commented Feb 6, 2025

NWilson commented Feb 6, 2025

NWilson commented Feb 6, 2025

zherczeg commented Feb 6, 2025

NWilson commented Feb 6, 2025

zherczeg commented Feb 6, 2025

NWilson commented Feb 6, 2025

PhilipHazel commented Feb 6, 2025