Document (and design) the binary index format #156

valencik · 2023-12-10T15:35:23Z

Currently we use a binary format that was created adhoc by just throwing things at scodec until it worked.
Admittedly, I think it's awesome that it works at all, and scodec is very fun to use.

We should actually design a binary format and try to stick to it.

Pros:

Enable other "client" implementations
- perhaps pure handcrafted JS if we're worried about bundle size
- or pure wasm for bundle size + performance
Help ensure compatibility
- it is desirable to have future clients be able to read old indexes as library documentation indexes will be written with whatever version of protosearch was out at the time

Cons:

I have no idea what I'm doing. Designing a binary format seems hard.
Why are we using binary at all? Why not gzip some JSON?
- Using JSON means we get to leverage existing JSON tools, encoders/decoders, jq for inspecting the file

The text was updated successfully, but these errors were encountered:

valencik · 2023-12-10T15:39:08Z

Why Binary?

it's what Lucene does
- I've long wanted an FST for the terms list, and Lucene encodes this into a byte array, so I've always assumed we'd need to support binary
we can likely save more space
we can likely get more performance by enabling readers to jump to various byte offsets in the file depending on what they need

Design Notes

include some magic bytes at the front to identify the file type
include index file format version
- so we can evolve the format without breaking things
include a metadata format to include the version of protosearch the wrote the index
- for better debugging
what to do about compression (gzip, zstd, etc)?
- the whole index should be optionally compressed with gzip, zstd, or whatever compression algorithm the user desires
- certain data structures will already be "compressed" in the sense that we may use tricks like variable byte integer encodings and storying the DocID deltas in a list of DocIDs
- it's possible that we want stored fields to be compressed even inside the index file
  - probably the main goal here is to still enable jumplists over the stored fields
a chunked/block structure
- each block indicates its type and length
- allows extending an index with additional data that readers could optionally ignore
- this probably makes the most sense for stored fields which could be quite long
- unsure how this works with the compression point above

Some resources:

valencik added the index Related to indexing, info in the index label Dec 10, 2023

valencik self-assigned this Mar 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document (and design) the binary index format #156

Document (and design) the binary index format #156

valencik commented Dec 10, 2023

valencik commented Dec 10, 2023 •

edited

Loading

Document (and design) the binary index format #156

Document (and design) the binary index format #156

Comments

valencik commented Dec 10, 2023

valencik commented Dec 10, 2023 • edited Loading

Why Binary?

Design Notes

Some resources:

valencik commented Dec 10, 2023 •

edited

Loading