Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single compilation pass over METAs (wild idea) #587

Open
NWilson opened this issue Dec 2, 2024 · 4 comments
Open

Single compilation pass over METAs (wild idea) #587

NWilson opened this issue Dec 2, 2024 · 4 comments
Labels
untidiness Not exactly a bug, but could do better
Milestone

Comments

@NWilson
Copy link
Member

NWilson commented Dec 2, 2024

At the moment, there's a pre-compilation pass of chars into METAs.

Then, there are two passes over the METAs, once to find the length of the buffer, then secondly to write into the allocated buffer.

Can we just combine those? We'd need to have a buffer allocation strategy:

  • Some heuristics for initial size (which could be guessed reasonably accurately from the META pass)
  • Then if we need more buffer space, we can realloc to 1.5× (for example). A bit like appending to a dynamic container: resize with geometric growth
  • Finally, if there's any wastage at the end (eg. we ended up allocating 350 bytes but only used 300) then we could realloc down to release the unused space. Or, just accept it as OK, if the overestimate was small.

The upside would be: simpler code! And faster for most users (since only one pass needed). This is assuming that the change from 1×malloc + two pass compilation1×malloc + 1×realloc + single pass is actually an improvement.

The downside would be marginally higher memory usage for users with many many regexes, but realloc'ing down to the correct size at the end should solve that.

Getting rid of all the lengthptr != NULL code would be really quite nice.

@zherczeg
Copy link
Collaborator

zherczeg commented Dec 2, 2024

Reallocing is something that I never know it is good or bad. Probably depends on the allocator. Another option is more caching. For example character ranges is cached during lengthptr==null phase. Caching is just "pushing" in practice, since we walk the data twice in the same order, so no "searching" is needed, the reading order follows the creation order.

Overall, experimenting with other methods, and proving they are better is a resource consuming process.

@PhilipHazel
Copy link
Collaborator

Unfortunately, I screwed up when I designed the PCRE2 API in that the custom allocator interface has only alloc and free entries. There is no support for re-alloc. In any case, I would hope that considerations of this sort might be postponed till we manage to get 10.45 (and possibly 46, 47, ... because no doubt there will be issues after all the big changes) out of the door.

@NWilson
Copy link
Member Author

NWilson commented Dec 2, 2024

Yes of course! No rush!

@zherczeg
Copy link
Collaborator

zherczeg commented Dec 3, 2024

Usually these kinds of tasks that we do in the University for our partners.

@NWilson NWilson added the untidiness Not exactly a bug, but could do better label Dec 9, 2024
@NWilson NWilson added this to the Future milestone Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
untidiness Not exactly a bug, but could do better
Projects
None yet
Development

No branches or pull requests

3 participants