Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move character lists data before the byte code in a pattern #540

Merged
merged 1 commit into from
Oct 31, 2024

Conversation

zherczeg
Copy link
Collaborator

This patch moves the character lists to the end of the pattern, and only a reference is stored. This way repeating, repeat detection, and serializing works without modifications. Also, for repeating character sets inside brackets can save a lot of space. The disadvantage is that an extra variable is needed to be maintained, which has a very low overhead, but still an overhead. My first idea was storing pointers, but that is not serialization friendly, although not that hard to implement.

@zherczeg
Copy link
Collaborator Author

Btw the original problem was that repeating character classes can break alignment rules unfortunately. It would not be bad if (PATTERN){n,m} could be implemented without repeating the (PATTERN).

@zherczeg zherczeg force-pushed the pointer_to_char_list branch 3 times, most recently from d29e3ec to 8faa069 Compare October 28, 2024 09:41
@zherczeg zherczeg changed the title Move character lists to the end of the pattern Move character lists data before the byte code in a pattern Oct 28, 2024
@zherczeg
Copy link
Collaborator Author

I tried various variations for this code, but the engine always have a feature which blocked them. In the end the character list data is stored before the byte code. The cost is an extra argument for xclass, and an extra member in the real code. Computing the byte code start is simplified.

@zherczeg
Copy link
Collaborator Author

Context: a fuzzer got a comparing uninitialized memory error, when a subpattern is repeated. The reason is that space for alignemnt was not initialized. Then it turned out that repeating character lists may break alignment, so they need to be moved out from the byte code and stored elsewhere, and only a reference to them is stored in the byte code, which can be repeated. So this patch fixes a complicated and serious error in the new code.

@NWilson
Copy link
Member

NWilson commented Oct 28, 2024

Interesting. I would have thought the simplest thing would simply be to avoid adding alignment padding entirely, and simply have READ8, READ16, READ32 macros to do the unaligned reads using bit-ops. Just let the uint32 values lie wherever they fall naturally in the bytecode, and read them in a safe way.

Compilers are also super-good at optimising this pattern: uint32_t value; memcpy(&value, unaligned_address, 4). It's a good alternative to #define READ32(addr) (addr[0] << 24) | (addr[1] << 16) | (addr[2] << 8) | addr[3].

@NWilson
Copy link
Member

NWilson commented Oct 28, 2024

It's extra complexity to move data out of the main sequence of bytecode. And it's extra complexity to introduce any weird requirements on the bytecode, that would prevent it being memmove'd around. The bytecode is a PCRE2_SPTR, so its contents should be read and used, simply assuming that it has the PCRE2_SPTR base alignment. Adding padding so that certain slices inside it have stricter alignment then means that the code can't be copied and moved, which is unfortunate.

@zherczeg
Copy link
Collaborator Author

The bytecode header structure always needed at least 32 bit alignment, so you cannot move the byte code to any location. Unaligned accesses might be costly.

/* Char lists size is an even number,
because all items are 16 or 32 bit values. */
/* Char lists size is an even number, because all items are 16 or 32
bit values. The character list data is always aligned 32 bytes. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/bytes/bits/

CU2BYTES((PCRE2_SIZE)cb.names_found * (PCRE2_SIZE)cb.name_entry_size);

#if defined SUPPORT_WIDE_CHARS
if (cb.char_lists_size > 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really making a difference but I think cb.char_lists_size != 0 is clearer

@zherczeg zherczeg force-pushed the pointer_to_char_list branch from 8faa069 to 2ae56b2 Compare October 28, 2024 21:29
@zherczeg
Copy link
Collaborator Author

I have updated the patch

@NWilson
Copy link
Member

NWilson commented Oct 29, 2024

This will probably make a merge conflict with my PR #523, so it's a race to see which is merged first!

@zherczeg
Copy link
Collaborator Author

@PhilipHazel will decide. I have no problem with doing the merge.


cb->char_lists_size += char_lists_size;

memcpy((uint8_t*)cb->start_code - cb->char_lists_size,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, are the character lists laid out in backwards order? The first ones encountered in the pattern are just before the start of the code, and the character class data at the end of the pattern is placed at the start?

Or am I misunderstanding what this does? (It seems to be packing the character data into the same alloced buffer as the code, but stored in front of it.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This way we can reuse the code start value (part of the match block) as the base (list end) value. I could reverse the order (start from max to 0), but I don't see any gain from it. This way the generator is a bit less complex.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Maybe a comment on the layout would help.


cb->char_lists_size += char_lists_size;

memcpy((uint8_t*)cb->start_code - cb->char_lists_size,
(uint8_t*)(cranges + 1) + cranges->char_lists_start,
char_lists_size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could memset the padding bytes to 0xff or something. It appears you're just leaving whatever junk is in there from malloc?

If you do initialize... then you'll get predictable runtime behaviour (if something reads from it by mistake). If you don't initialize... then you'll get nice valgrind warnings.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make tools happy. Ok, I will check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best-of-both would be a memset to 0xff and also an ifdef SUPPORT_VALGRIND ... VALGRIND_MAKE_MEM_NOACCESS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer doing only the VALGRIND_MAKE_MEM_NOACCESS, so any mistaken reads can be detected, after all those valgrind warnings are a feature!, the memset is just a workaround IMHO.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a ((uint16_t*)data)[-1] = 0xffff; to set all bits. Btw valgrind only complains if you check the memory. Just copying it is not a problem.

Copy link
Member

@NWilson NWilson Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to check I understand - the data is 16-bit and/or 32-bit packed values, but we want it to be 32-bit aligned. So the data variable is always 16-bit aligned (at least), and if there is padding, it will be exactly 2 bytes. So the cast (uint16_t*)data is doing a correctly-aligned write of the two padding bytes.

That all looks great, thanks!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The character list data is a stream of 16 bit values followed by a stream of 32 bit values. Both stream can be empty. Regardless, the final data is considered to be a stream of 32 bit values, and there might be an unused 16 bit value in the beginning sometimes.

@zherczeg zherczeg force-pushed the pointer_to_char_list branch 2 times, most recently from f80bd56 to bf0e3e9 Compare October 29, 2024 12:07
char_buffer += 2;
}
cb->char_lists_size += 2;
/* Make tools happy by setting memory data. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/memory data/memory padding data/

@zherczeg zherczeg force-pushed the pointer_to_char_list branch from bf0e3e9 to 876589f Compare October 29, 2024 13:37
@zherczeg
Copy link
Collaborator Author

I changed the extra 16 bit setting to debug only, and added some asserts.

@zherczeg zherczeg force-pushed the pointer_to_char_list branch 2 times, most recently from a7592b4 to ee034d2 Compare October 30, 2024 10:57
Copy link
Member

@NWilson NWilson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, this look like it should work well

@@ -10206,13 +10206,36 @@ if (length > MAX_PATTERN_SIZE)
goto HAD_CB_ERROR;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block here could be in an "#else" block for the code you added.

Sorry, really minor nit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I follow. I wanted to be in the safe side here, since MAX_PATTERN_SIZE - length >= 0 after the check and never underflows.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, you want to avoid overflow. OK, that makes sense. Your latest version merges the two blocks, which looks tidy. Thanks!

This ensures aligned data store even when the range is repeated.
Furthermore character lists are stored once regerdless of repeats.
@zherczeg zherczeg force-pushed the pointer_to_char_list branch from ee034d2 to 0c67546 Compare October 30, 2024 13:03
@zherczeg zherczeg merged commit 24f9d8d into PCRE2Project:master Oct 31, 2024
15 checks passed
@zherczeg zherczeg deleted the pointer_to_char_list branch October 31, 2024 07:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants