-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor the writing and reading of the fulltext index #1699
Refactor the writing and reading of the fulltext index #1699
Conversation
…yet, commit is used to initialize branch
…quency and gap compressed lists.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1699 +/- ##
==========================================
- Coverage 89.95% 89.95% -0.01%
==========================================
Files 393 395 +2
Lines 37639 37625 -14
Branches 4233 4229 -4
==========================================
- Hits 33860 33847 -13
- Misses 2480 2482 +2
+ Partials 1299 1296 -3 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most important thing is:
Please tell me in which places you have changed something, and where you only extracted, s.t. we can properly review the nontrivial changes.
I really like this idea, everything that makes the index class smaller is good.
src/index/IndexImpl.Text.cpp
Outdated
std::ranges::copy(TextIndexReadWrite::readFreqComprList<Id, Score>( | ||
tbmd._cl._nofElements, tbmd._cl._startScorelist, | ||
static_cast<size_t>(tbmd._cl._lastByte + 1 - | ||
tbmd._cl._startScorelist), | ||
textIndexFile_, &Id::makeFromInt), | ||
idTable.getColumn(2).begin()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of the following (index-breaking) suggestion, which makes this code possibly simpler
(maybe we can postpone it to another PR, if this stalls your work here):
- We consistently directly compress and store the bits of the ID (as they are also consecutive for positive integers, the gap encoding and frequency encoding should still work). This gets rid of all the
Id::makeFromBlaIndex(BlaIndex::make(...))
calls in thetransform
andcopy
calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again: After some thought please remember this idea, but probably this is for future changes, as it is rather intrusive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would theoretically work fine but there is one slight Problem. This problem has do to with simple8b encoding after gap encoding. If we try to gap encode IDs the first element will be the starting ID without any encoding. Because IDs use their first few bits to determine what type of ID they are there will be a one in the first 4 bits of the ID. This then becomes a problem in simple8b encoding, since it only works for uint64_t with the first 4 bits being 0.
src/index/TextIndexReadWrite.h
Outdated
explicit GapEncode(const TypedVector& vectorToEncode); | ||
|
||
void writeToFile(ad_utility::File& out, size_t nofElements, | ||
off_t& currentOffset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I think you can get away with a lazy view as the input to the constructor.
- Would the code in
IndexImpl.cpp
become simpler if you make a static function that does the encoding + writing in one step (same for the other encoders).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Also maybe part for a separate PR, your work here is valuable, by moving it to a separate file it now has a size where we can see the possible improvements much simpler.
…th using vector.data()
The 2 SonarQube issues can't be adressed with the current implementation of ranges if I am correct. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small reviews from the diff, I'll have another pass over everything with your // MODIFIED
comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some additional small comments.
src/index/TextIndexReadWrite.h
Outdated
size_t nofCodebookBytes; | ||
vector<uint64_t> frequencyEncodedResult; | ||
frequencyEncodedResult.resize(nofElements + 250); | ||
off_t current = from; | ||
size_t ret = textIndexFile.read(&nofCodebookBytes, sizeof(size_t), current); | ||
LOG(TRACE) << "Nof Codebook Bytes: " << nofCodebookBytes << '\n'; | ||
AD_CONTRACT_CHECK(sizeof(size_t) == ret); | ||
current += ret; | ||
std::vector<From> codebook; | ||
codebook.resize(nofCodebookBytes / sizeof(From)); | ||
ret = textIndexFile.read(codebook.data(), nofCodebookBytes, current); | ||
current += ret; | ||
AD_CONTRACT_CHECK(ret == size_t(nofCodebookBytes)); | ||
std::vector<uint64_t> simple8bEncoded; | ||
simple8bEncoded.resize(nofElements); | ||
ret = textIndexFile.read(simple8bEncoded.data(), nofBytes - (current - from), | ||
current); | ||
current += ret; | ||
AD_CONTRACT_CHECK(size_t(current - from) == nofBytes); | ||
LOG(DEBUG) << "Decoding Simple8b code...\n"; | ||
ad_utility::Simple8bCode::decode(simple8bEncoded.data(), nofElements, | ||
frequencyEncodedResult.data()); | ||
LOG(DEBUG) << "Reverting frequency encoded items to actual IDs...\n"; | ||
frequencyEncodedResult.resize(nofElements); | ||
vector<To> result; | ||
result.reserve(frequencyEncodedResult.size()); | ||
ql::ranges::for_each(frequencyEncodedResult, [&](const auto& encoded) { | ||
result.push_back(transformer(codebook.at(encoded))); | ||
}); | ||
LOG(DEBUG) << "Done reading frequency-encoded list. Size: " << result.size() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In one of your next PRs you can refactor the reading and writing here, using the functionalities in
util/serialization
, because all those manual read
and offset += ret
calls are very hard to read.
src/index/TextIndexReadWrite.h
Outdated
From previous = 0; | ||
for (size_t i = 0; i < gapEncodedVector.size(); ++i) { | ||
previous += gapEncodedVector[i]; | ||
result.push_back(transformer(previous)); | ||
} | ||
LOG(DEBUG) << "Done reading gap-encoded list. Size: " << result.size() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it (same for the frequency encoding)
happen, that To and From are the same type, (and the transformation therefore does nothing).
In this case you could simply return the read vector, without the (then redundant) copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the current implementation that doesn't happen since the type To is always Id when reading (the only reading happens in IndexImpl.Text.cpp) and the type From differs.
src/index/TextIndexReadWrite.cpp
Outdated
template <typename T> | ||
void writeVectorAndMoveOffset(const std::vector<T>& vectorToWrite, | ||
size_t nofElements, ad_utility::File& file, | ||
off_t& currentOffset) { | ||
size_t bytes = | ||
textIndexReadWrite::writeList(vectorToWrite, nofElements, file); | ||
currentOffset += bytes; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a separate PR, you can refactor this to use our serialization library, which makes the reading and writing of such types to and from disk much more readable.
…g that can be represented as view. Deleted nofElements as parameter of some functions where it wasn't necessary. Applied other small changes requested in PR
…oded list" This reverts commit 464a6fb.
I tried several things to fix the error but I can't get it to work. I don't quite understand how to get a funciton to accept all types of ranges that and views to work with them. |
# Conflicts: # src/index/IndexImpl.Text.cpp
Signed-off-by: Johannes Kalmbach <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much!
I merged in the current master (easy) and simply removed the viewable_range
concept (see my comment there).
As this is getting much more cleaned up, I have found some smaller cleanups to be performed in existing code, most of them should be easy to perform, otherwise this is almost ready to merge.
…unnecessary construction of spans. Changed manual iterating to compute adjanced differences to using std::adjacent_difference.
…can accept a pointer to write to. Removed unnecessary function writeList and moved it into writeVectorAndMoveOffset which was renamed to encodeAndWriteSpanAndMoveOffset. Removed the communication comments and added correct comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some final suggestions to make Sonarcloud happy.
src/index/TextIndexReadWrite.h
Outdated
@@ -158,9 +275,15 @@ class FrequencyEncode { | |||
template <typename View> | |||
explicit FrequencyEncode(View&& view); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sonarcloud explains its trouble with this constructor pretty closely, and also suggests a solution:
you have to constrain it via a concept or something similar (see the how
tab in the sonarcloud explanation on how to best do this). That is the only way to silence that message (which points out a serious potential issue).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I addressed the probelm I got from trying to fix this in the respective email.
…cyEncode. Added required to check that View is not of GapEncode or FreqEncode. Added std::forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should now really be the last round, I have given you all the information you should need.
src/index/TextIndexReadWrite.h
Outdated
requires(!std::same_as<FrequencyEncode, std::remove_cvref_t<View>>) | ||
explicit FrequencyEncode(View&& inputView) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move this to the .cpp-File again.
- C++20 requires-clauses must be repeatet in the definition, so
in the .cpp-file you woul also have to write
template <typename View> requires(!std:...) FrequencyEncode(View&&...) {...
2 We are currently backporting QLever to C++ 17, so we use macros that use concepts only in C++17 mode. Please use the CPP_template
and CPP_template_def
macros, see here for details.
So in the header, you write
CPP_template(typename View)(requires !std::same_as...) exlicit...
And in the .cpp file
CPP_template_def(typename View> (requires ...) FrequencyEncode...
(same for the GapEncode
class.
src/index/TextIndexReadWrite.h
Outdated
explicit FrequencyEncode(View&& view); | ||
requires(!std::same_as<FrequencyEncode, std::remove_cvref_t<View>>) | ||
explicit FrequencyEncode(View&& inputView) { | ||
auto view = std::forward<View>(inputView); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is absolutely a bad idea, it introduces unneeded copies. Please remove this, thecdecov warnings about no forwarding are fine for now.
src/index/TextIndexReadWrite.h
Outdated
requires(!std::same_as<GapEncode, std::remove_cvref_t<View>>) | ||
explicit GapEncode(View&& inputView) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above
src/index/TextIndexReadWrite.h
Outdated
explicit GapEncode(View&& view); | ||
requires(!std::same_as<GapEncode, std::remove_cvref_t<View>>) | ||
explicit GapEncode(View&& inputView) { | ||
auto view = std::forward<View>(inputView); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely same comment as abve.
… that used std::forward
Signed-off-by: Johannes Kalmbach <[email protected]>
This reverts commit 255349e.
… to the cpp file. Forwarded the constructor arguments to that method. Explained the reason in the comment above
Conformance check passed ✅No test result changes. |
|
IndexImpl
class significantly smaller.