-
-
Notifications
You must be signed in to change notification settings - Fork 699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reverse reverse index #2505
Comments
Some notes as I poke around Tantivy. Storing terms as a fastfieldI thought this might be implemented with fastfields (columnar storage) which "is designed for the fast random access of some document fields given a document id". Edit: Oh! Perhaps as a minimal first pass, I could just mark my existing text field as a fast field and specify a tokenizer. Then, wrap a column reader like FacetReader does, to allow fetching by a document id. tantivy/src/fastfield/writer.rs Lines 134 to 146 in 2f5a269
This method would mean my text is getting tokenized twice though, and I'd be storing the whole term(?) rather than just a term ordinal. #1325, implementation of fastfield for strings might be relevant here. (However, it looks like the codebase has changed a lot since this PR. For example, the postings writer no longer seems to pass an 'unordered_term_id' to the fastfield module.) Getting the terms per documentThe token stream for a document is processed into terms in tantivy/src/postings/postings_writer.rs Lines 138 to 155 in 2f5a269
tantivy/src/postings/postings_writer.rs Lines 201 to 221 in 2f5a269
tantivy/src/postings/recorder.rs Lines 49 to 85 in 2f5a269
Exposing as an optionThis could be exposed as another (or a different kind of) tantivy/src/postings/per_field_postings_writer.rs Lines 33 to 46 in 2f5a269
Alternatively, it might be nice to enable this per document. For example, so I can just keep this kinda index for the latest ~20% of documents. In which case, maybe this could be implemented as a new field type. |
You could load the document from the docstore and tokenize the text to get the terms |
Ah, but I am not storing this (quite large) text field |
The fast field (columnar) version should work too I think, but the dictionary is not shared between the inverted index and the columnar storage |
Did it used to be shared? I see your contribution here is passing an Edit: Maybe that's what this is about #1705 (comment) |
It can't be shared anymore since a different tokenizer can be defined now |
I'm wanting to look up the terms for a given document, that is
Document -> Field -> Terms
. Something like what term vectors provide in Lucene. (However, I see positions are already stored a little different in Tantivy.)The use case is things like analysing the term distributions in a document, (for text classification, summarization, highlighting query terms) and copying an individual, indexed documents to another index.
I'm thinking of this like a
HashMap<DocId, Vec<Term>>
whereTerm
(somehow) is a reference to the Term in the termdict and there will be one of theseHashMap<,>
reverse reverse indexes per reverse index segment so we (somehow) need to participate in the merge process. I notice that Lucene.NET has an interface 'IntervedDocConsumer' which is how term vectors (and something called 'freqprox') hook into the indexing chain so maybe that's a place to draw inspiration.Edit: It looks like
Recorder
might be the right interface in Tantivy for writing to this new index. For example, TermFrequenceRecorder.Can you share any initial thoughts in how you might approach this? Even the very first things that come to your mind will likely greatly accelerate me if I am to try and extend Tantivy to support this kind of index.
The text was updated successfully, but these errors were encountered: