Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document (and design) the binary index format #156

Open
valencik opened this issue Dec 10, 2023 · 1 comment
Open

Document (and design) the binary index format #156

valencik opened this issue Dec 10, 2023 · 1 comment
Assignees
Labels
index Related to indexing, info in the index

Comments

@valencik
Copy link
Contributor

Currently we use a binary format that was created adhoc by just throwing things at scodec until it worked.
Admittedly, I think it's awesome that it works at all, and scodec is very fun to use.

We should actually design a binary format and try to stick to it.

Pros:

  • Enable other "client" implementations
    • perhaps pure handcrafted JS if we're worried about bundle size
    • or pure wasm for bundle size + performance
  • Help ensure compatibility
    • it is desirable to have future clients be able to read old indexes as library documentation indexes will be written with whatever version of protosearch was out at the time

Cons:

  • I have no idea what I'm doing. Designing a binary format seems hard.
  • Why are we using binary at all? Why not gzip some JSON?
    • Using JSON means we get to leverage existing JSON tools, encoders/decoders, jq for inspecting the file
@valencik
Copy link
Contributor Author

valencik commented Dec 10, 2023

Why Binary?

  • it's what Lucene does
    • I've long wanted an FST for the terms list, and Lucene encodes this into a byte array, so I've always assumed we'd need to support binary
  • we can likely save more space
  • we can likely get more performance by enabling readers to jump to various byte offsets in the file depending on what they need

Design Notes

  • include some magic bytes at the front to identify the file type
  • include index file format version
    • so we can evolve the format without breaking things
  • include a metadata format to include the version of protosearch the wrote the index
    • for better debugging
  • what to do about compression (gzip, zstd, etc)?
    • the whole index should be optionally compressed with gzip, zstd, or whatever compression algorithm the user desires
    • certain data structures will already be "compressed" in the sense that we may use tricks like variable byte integer encodings and storying the DocID deltas in a list of DocIDs
    • it's possible that we want stored fields to be compressed even inside the index file
      • probably the main goal here is to still enable jumplists over the stored fields
  • a chunked/block structure
    • each block indicates its type and length
    • allows extending an index with additional data that readers could optionally ignore
    • this probably makes the most sense for stored fields which could be quite long
    • unsure how this works with the compression point above

Some resources:

@valencik valencik added the index Related to indexing, info in the index label Dec 10, 2023
@valencik valencik self-assigned this Mar 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
index Related to indexing, info in the index
Projects
None yet
Development

No branches or pull requests

1 participant