Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse bitcode incrementally #190

Open
RyanGlScott opened this issue May 23, 2022 · 1 comment
Open

Parse bitcode incrementally #190

RyanGlScott opened this issue May 23, 2022 · 1 comment

Comments

@RyanGlScott
Copy link
Contributor

Currently, parsing is whole-program: the only way to parse a bitcode file is to parse every entity inside it. For most clients of llvm-pretty-bc-parser, however, this is wasteful, as these clients typically only need a fraction of the entities inside the file. Nevertheless, every client must currently pay the cost of both loading the entire file, which can be a time-consuming process, and keeping the loaded contents around in memory, which can use quite a bit of space. This is especially painful for large bitcode files, which are common in C++ programs and in large applications.

Fortunately, we can do better here by deferring some of the parsing work until a client actually needs it. The LLVM Bitcode File Format page has this to say about blocks in a bitstream:

Block definitions allow the reader to efficiently skip blocks in constant time if the reader wants a summary of blocks, or if it wants to efficiently skip data it does not understand. The LLVM IR reader uses this mechanism to skip function bodies, lazily reading them on demand.

I propose that we do just that: defer the parsing of function bodies until a client actually requests it. I haven't precisely worked out the finer details of implementing this, but as with most things LLVM-related, the best place to learn is from the upstream source code for the bitcode reader. In broad terms, we will want to preprocess the bitcode file to construct a mapping from function names to the blocks containing their bodies. Later, when a client requests a particular function, if that function's body has not been parsed, we will do so on demand.

Some currently unresolved design questions:

  • What should the API for this look like? The rough idea I had in my head is something like this:

    parseBitCodeIncremental :: ByteString -> IO (Either Error BitCodeHandle)

    Where BitCodeHandle is an abstract data type representing an opened, bitcode file in the middle of being incrementally parsed. We would need to figure out how many operations on top of BitCodeHandle to provide, e.g.:

    lookupFunction :: Symbol -> BitCodeHandle -> IO (Either Error Define)
  • Does it suffice to only parse function bodies incrementally, or do we need to handle other things incrementally as well?

  • Should an API for incremental parsing replace the current API for eager parsing? Should they live alongside each other? If the latter, should the API for eager parsing be reimplemented in terms of the API for incremental parsing?

See also #176, which discusses other possible performance improvements.

@langston-barrett
Copy link
Contributor

See also #133

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants