Move character lists data before the byte code in a pattern #540

zherczeg · 2024-10-27T17:30:57Z

This patch moves the character lists to the end of the pattern, and only a reference is stored. This way repeating, repeat detection, and serializing works without modifications. Also, for repeating character sets inside brackets can save a lot of space. The disadvantage is that an extra variable is needed to be maintained, which has a very low overhead, but still an overhead. My first idea was storing pointers, but that is not serialization friendly, although not that hard to implement.

zherczeg · 2024-10-27T17:32:35Z

Btw the original problem was that repeating character classes can break alignment rules unfortunately. It would not be bad if (PATTERN){n,m} could be implemented without repeating the (PATTERN).

zherczeg · 2024-10-28T09:44:21Z

I tried various variations for this code, but the engine always have a feature which blocked them. In the end the character list data is stored before the byte code. The cost is an extra argument for xclass, and an extra member in the real code. Computing the byte code start is simplified.

zherczeg · 2024-10-28T09:52:41Z

Context: a fuzzer got a comparing uninitialized memory error, when a subpattern is repeated. The reason is that space for alignemnt was not initialized. Then it turned out that repeating character lists may break alignment, so they need to be moved out from the byte code and stored elsewhere, and only a reference to them is stored in the byte code, which can be repeated. So this patch fixes a complicated and serious error in the new code.

NWilson · 2024-10-28T11:43:29Z

Interesting. I would have thought the simplest thing would simply be to avoid adding alignment padding entirely, and simply have READ8, READ16, READ32 macros to do the unaligned reads using bit-ops. Just let the uint32 values lie wherever they fall naturally in the bytecode, and read them in a safe way.

Compilers are also super-good at optimising this pattern: uint32_t value; memcpy(&value, unaligned_address, 4). It's a good alternative to #define READ32(addr) (addr[0] << 24) | (addr[1] << 16) | (addr[2] << 8) | addr[3].

NWilson · 2024-10-28T11:47:23Z

It's extra complexity to move data out of the main sequence of bytecode. And it's extra complexity to introduce any weird requirements on the bytecode, that would prevent it being memmove'd around. The bytecode is a PCRE2_SPTR, so its contents should be read and used, simply assuming that it has the PCRE2_SPTR base alignment. Adding padding so that certain slices inside it have stricter alignment then means that the code can't be copied and moved, which is unfortunate.

zherczeg · 2024-10-28T12:17:13Z

The bytecode header structure always needed at least 32 bit alignment, so you cannot move the byte code to any location. Unaligned accesses might be costly.

carenas · 2024-10-28T16:17:58Z

src/pcre2_compile.c

-        /* Char lists size is an even number,
-        because all items are 16 or 32 bit values. */
+        /* Char lists size is an even number, because all items are 16 or 32
+        bit values. The character list data is always aligned 32 bytes. */


s/bytes/bits/

carenas · 2024-10-28T20:08:06Z

src/pcre2_compile.c

+  CU2BYTES((PCRE2_SIZE)cb.names_found * (PCRE2_SIZE)cb.name_entry_size);
+
+#if defined SUPPORT_WIDE_CHARS
+if (cb.char_lists_size > 0)


not really making a difference but I think cb.char_lists_size != 0 is clearer

zherczeg · 2024-10-28T21:30:18Z

I have updated the patch

NWilson · 2024-10-29T10:18:46Z

This will probably make a merge conflict with my PR #523, so it's a race to see which is merged first!

zherczeg · 2024-10-29T10:24:02Z

@PhilipHazel will decide. I have no problem with doing the merge.

NWilson · 2024-10-29T10:26:14Z

src/pcre2_compile.c

+
+          cb->char_lists_size += char_lists_size;
+
+          memcpy((uint8_t*)cb->start_code - cb->char_lists_size,


Interesting, are the character lists laid out in backwards order? The first ones encountered in the pattern are just before the start of the code, and the character class data at the end of the pattern is placed at the start?

Or am I misunderstanding what this does? (It seems to be packing the character data into the same alloced buffer as the code, but stored in front of it.)

Yes. This way we can reuse the code start value (part of the match block) as the base (list end) value. I could reverse the order (start from max to 0), but I don't see any gain from it. This way the generator is a bit less complex.

OK. Maybe a comment on the layout would help.

NWilson · 2024-10-29T10:39:44Z

src/pcre2_compile.c

+
+          cb->char_lists_size += char_lists_size;
+
+          memcpy((uint8_t*)cb->start_code - cb->char_lists_size,
            (uint8_t*)(cranges + 1) + cranges->char_lists_start,
            char_lists_size);


You could memset the padding bytes to 0xff or something. It appears you're just leaving whatever junk is in there from malloc?

If you do initialize... then you'll get predictable runtime behaviour (if something reads from it by mistake). If you don't initialize... then you'll get nice valgrind warnings.

Make tools happy. Ok, I will check.

The best-of-both would be a memset to 0xff and also an ifdef SUPPORT_VALGRIND ... VALGRIND_MAKE_MEM_NOACCESS.

I prefer doing only the VALGRIND_MAKE_MEM_NOACCESS, so any mistaken reads can be detected, after all those valgrind warnings are a feature!, the memset is just a workaround IMHO.

I have added a ((uint16_t*)data)[-1] = 0xffff; to set all bits. Btw valgrind only complains if you check the memory. Just copying it is not a problem.

Just to check I understand - the data is 16-bit and/or 32-bit packed values, but we want it to be 32-bit aligned. So the data variable is always 16-bit aligned (at least), and if there is padding, it will be exactly 2 bytes. So the cast (uint16_t*)data is doing a correctly-aligned write of the two padding bytes.

That all looks great, thanks!

Yes. The character list data is a stream of 16 bit values followed by a stream of 32 bit values. Both stream can be empty. Regardless, the final data is considered to be a stream of 32 bit values, and there might be an unused 16 bit value in the beginning sometimes.

carenas · 2024-10-29T12:23:53Z

src/pcre2_compile.c

-            char_buffer += 2;
-            }
+            cb->char_lists_size += 2;
+            /* Make tools happy by setting memory data. */


s/memory data/memory padding data/

zherczeg · 2024-10-29T13:39:47Z

I changed the extra 16 bit setting to debug only, and added some asserts.

NWilson

Great, this look like it should work well

NWilson · 2024-10-30T11:45:06Z

src/pcre2_compile.c

@@ -10206,13 +10206,36 @@ if (length > MAX_PATTERN_SIZE)
  goto HAD_CB_ERROR;
  }


This block here could be in an "#else" block for the code you added.

Sorry, really minor nit.

I am not sure I follow. I wanted to be in the safe side here, since MAX_PATTERN_SIZE - length >= 0 after the check and never underflows.

I see, you want to avoid overflow. OK, that makes sense. Your latest version merges the two blocks, which looks tidy. Thanks!

This ensures aligned data store even when the range is repeated. Furthermore character lists are stored once regerdless of repeats.

zherczeg force-pushed the pointer_to_char_list branch 3 times, most recently from d29e3ec to 8faa069 Compare October 28, 2024 09:41

zherczeg changed the title ~~Move character lists to the end of the pattern~~ Move character lists data before the byte code in a pattern Oct 28, 2024

carenas approved these changes Oct 28, 2024

View reviewed changes

zherczeg force-pushed the pointer_to_char_list branch from 8faa069 to 2ae56b2 Compare October 28, 2024 21:29

NWilson reviewed Oct 29, 2024

View reviewed changes

zherczeg force-pushed the pointer_to_char_list branch 2 times, most recently from f80bd56 to bf0e3e9 Compare October 29, 2024 12:07

carenas reviewed Oct 29, 2024

View reviewed changes

zherczeg force-pushed the pointer_to_char_list branch from bf0e3e9 to 876589f Compare October 29, 2024 13:37

zherczeg force-pushed the pointer_to_char_list branch 2 times, most recently from a7592b4 to ee034d2 Compare October 30, 2024 10:57

NWilson approved these changes Oct 30, 2024

View reviewed changes

Move character lists data before the byte code in a pattern

0c67546

This ensures aligned data store even when the range is repeated. Furthermore character lists are stored once regerdless of repeats.

zherczeg force-pushed the pointer_to_char_list branch from ee034d2 to 0c67546 Compare October 30, 2024 13:03

zherczeg merged commit 24f9d8d into PCRE2Project:master Oct 31, 2024
15 checks passed

zherczeg deleted the pointer_to_char_list branch October 31, 2024 07:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move character lists data before the byte code in a pattern #540

Move character lists data before the byte code in a pattern #540

zherczeg commented Oct 27, 2024

zherczeg commented Oct 27, 2024

zherczeg commented Oct 28, 2024

zherczeg commented Oct 28, 2024

NWilson commented Oct 28, 2024

NWilson commented Oct 28, 2024

zherczeg commented Oct 28, 2024

carenas Oct 28, 2024

carenas Oct 28, 2024

zherczeg commented Oct 28, 2024

NWilson commented Oct 29, 2024

zherczeg commented Oct 29, 2024

NWilson Oct 29, 2024

zherczeg Oct 29, 2024

NWilson Oct 29, 2024

NWilson Oct 29, 2024

zherczeg Oct 29, 2024

NWilson Oct 29, 2024

carenas Oct 29, 2024

zherczeg Oct 29, 2024

NWilson Oct 29, 2024 •

edited

Loading

zherczeg Oct 29, 2024

carenas Oct 29, 2024

zherczeg commented Oct 29, 2024

NWilson left a comment

NWilson Oct 30, 2024

zherczeg Oct 30, 2024

NWilson Oct 30, 2024


		cb->char_lists_size += char_lists_size;

		memcpy((uint8_t*)cb->start_code - cb->char_lists_size,

		@@ -10206,13 +10206,36 @@ if (length > MAX_PATTERN_SIZE)
		goto HAD_CB_ERROR;
		}

Move character lists data before the byte code in a pattern #540

Move character lists data before the byte code in a pattern #540

Conversation

zherczeg commented Oct 27, 2024

zherczeg commented Oct 27, 2024

zherczeg commented Oct 28, 2024

zherczeg commented Oct 28, 2024

NWilson commented Oct 28, 2024

NWilson commented Oct 28, 2024

zherczeg commented Oct 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zherczeg commented Oct 28, 2024

NWilson commented Oct 29, 2024

zherczeg commented Oct 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NWilson Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zherczeg commented Oct 29, 2024

NWilson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NWilson Oct 29, 2024 •

edited

Loading