Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor the writing and reading of the fulltext index #1699

Merged
merged 31 commits into from
Jan 30, 2025

Conversation

Flixtastic
Copy link
Contributor

@Flixtastic Flixtastic commented Jan 4, 2025

  • Move the code for reading and writing the data structures of the text index to a separate class, making the large IndexImpl class significantly smaller.
  • While doing so, also completely refactor and modernize this code.

Copy link

codecov bot commented Jan 4, 2025

Codecov Report

Attention: Patch coverage is 93.42723% with 14 lines in your changes missing coverage. Please review.

Project coverage is 89.95%. Comparing base (d060444) to head (149ef23).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
src/index/TextIndexReadWrite.cpp 91.57% 4 Missing and 4 partials ⚠️
src/index/TextIndexReadWrite.h 91.66% 2 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1699      +/-   ##
==========================================
- Coverage   89.95%   89.95%   -0.01%     
==========================================
  Files         393      395       +2     
  Lines       37639    37625      -14     
  Branches     4233     4229       -4     
==========================================
- Hits        33860    33847      -13     
- Misses       2480     2482       +2     
+ Partials     1299     1296       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most important thing is:
Please tell me in which places you have changed something, and where you only extracted, s.t. we can properly review the nontrivial changes.
I really like this idea, everything that makes the index class smaller is good.

Comment on lines 596 to 601
std::ranges::copy(TextIndexReadWrite::readFreqComprList<Id, Score>(
tbmd._cl._nofElements, tbmd._cl._startScorelist,
static_cast<size_t>(tbmd._cl._lastByte + 1 -
tbmd._cl._startScorelist),
textIndexFile_, &Id::makeFromInt),
idTable.getColumn(2).begin());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of the following (index-breaking) suggestion, which makes this code possibly simpler
(maybe we can postpone it to another PR, if this stalls your work here):

  1. We consistently directly compress and store the bits of the ID (as they are also consecutive for positive integers, the gap encoding and frequency encoding should still work). This gets rid of all the Id::makeFromBlaIndex(BlaIndex::make(...)) calls in the transform and copy calls.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again: After some thought please remember this idea, but probably this is for future changes, as it is rather intrusive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would theoretically work fine but there is one slight Problem. This problem has do to with simple8b encoding after gap encoding. If we try to gap encode IDs the first element will be the starting ID without any encoding. Because IDs use their first few bits to determine what type of ID they are there will be a one in the first 4 bits of the ID. This then becomes a problem in simple8b encoding, since it only works for uint64_t with the first 4 bits being 0.

src/index/TextIndexReadWrite.cpp Outdated Show resolved Hide resolved
src/index/TextIndexReadWrite.cpp Outdated Show resolved Hide resolved
src/index/TextIndexReadWrite.cpp Show resolved Hide resolved
Comment on lines 147 to 150
explicit GapEncode(const TypedVector& vectorToEncode);

void writeToFile(ad_utility::File& out, size_t nofElements,
off_t& currentOffset);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I think you can get away with a lazy view as the input to the constructor.
  2. Would the code in IndexImpl.cpp become simpler if you make a static function that does the encoding + writing in one step (same for the other encoders).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Also maybe part for a separate PR, your work here is valuable, by moving it to a separate file it now has a size where we can see the possible improvements much simpler.

src/index/TextIndexReadWrite.h Outdated Show resolved Hide resolved
@Flixtastic Flixtastic requested a review from joka921 January 9, 2025 19:38
@Flixtastic
Copy link
Contributor Author

The 2 SonarQube issues can't be adressed with the current implementation of ranges if I am correct.

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small reviews from the diff, I'll have another pass over everything with your // MODIFIED comments.

src/index/TextIndexReadWrite.cpp Outdated Show resolved Hide resolved
src/index/TextIndexReadWrite.cpp Outdated Show resolved Hide resolved
src/index/TextIndexReadWrite.h Outdated Show resolved Hide resolved
Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additional small comments.

src/index/TextIndexReadWrite.h Outdated Show resolved Hide resolved
Comment on lines 71 to 100
size_t nofCodebookBytes;
vector<uint64_t> frequencyEncodedResult;
frequencyEncodedResult.resize(nofElements + 250);
off_t current = from;
size_t ret = textIndexFile.read(&nofCodebookBytes, sizeof(size_t), current);
LOG(TRACE) << "Nof Codebook Bytes: " << nofCodebookBytes << '\n';
AD_CONTRACT_CHECK(sizeof(size_t) == ret);
current += ret;
std::vector<From> codebook;
codebook.resize(nofCodebookBytes / sizeof(From));
ret = textIndexFile.read(codebook.data(), nofCodebookBytes, current);
current += ret;
AD_CONTRACT_CHECK(ret == size_t(nofCodebookBytes));
std::vector<uint64_t> simple8bEncoded;
simple8bEncoded.resize(nofElements);
ret = textIndexFile.read(simple8bEncoded.data(), nofBytes - (current - from),
current);
current += ret;
AD_CONTRACT_CHECK(size_t(current - from) == nofBytes);
LOG(DEBUG) << "Decoding Simple8b code...\n";
ad_utility::Simple8bCode::decode(simple8bEncoded.data(), nofElements,
frequencyEncodedResult.data());
LOG(DEBUG) << "Reverting frequency encoded items to actual IDs...\n";
frequencyEncodedResult.resize(nofElements);
vector<To> result;
result.reserve(frequencyEncodedResult.size());
ql::ranges::for_each(frequencyEncodedResult, [&](const auto& encoded) {
result.push_back(transformer(codebook.at(encoded)));
});
LOG(DEBUG) << "Done reading frequency-encoded list. Size: " << result.size()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In one of your next PRs you can refactor the reading and writing here, using the functionalities in
util/serialization , because all those manual read and offset += ret calls are very hard to read.

Comment on lines 134 to 139
From previous = 0;
for (size_t i = 0; i < gapEncodedVector.size(); ++i) {
previous += gapEncodedVector[i];
result.push_back(transformer(previous));
}
LOG(DEBUG) << "Done reading gap-encoded list. Size: " << result.size()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it (same for the frequency encoding)
happen, that To and From are the same type, (and the transformation therefore does nothing).
In this case you could simply return the read vector, without the (then redundant) copy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current implementation that doesn't happen since the type To is always Id when reading (the only reading happens in IndexImpl.Text.cpp) and the type From differs.

src/index/TextIndexReadWrite.cpp Outdated Show resolved Hide resolved
Comment on lines 94 to 102
template <typename T>
void writeVectorAndMoveOffset(const std::vector<T>& vectorToWrite,
size_t nofElements, ad_utility::File& file,
off_t& currentOffset) {
size_t bytes =
textIndexReadWrite::writeList(vectorToWrite, nofElements, file);
currentOffset += bytes;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a separate PR, you can refactor this to use our serialization library, which makes the reading and writing of such types to and from disk much more readable.

…g that can be represented as view. Deleted nofElements as parameter of some functions where it wasn't necessary. Applied other small changes requested in PR
@Flixtastic
Copy link
Contributor Author

I tried several things to fix the error but I can't get it to work. I don't quite understand how to get a funciton to accept all types of ranges that and views to work with them.

@Flixtastic Flixtastic requested a review from joka921 January 21, 2025 11:47
# Conflicts:
#	src/index/IndexImpl.Text.cpp
Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much!

I merged in the current master (easy) and simply removed the viewable_range concept (see my comment there).

As this is getting much more cleaned up, I have found some smaller cleanups to be performed in existing code, most of them should be easy to perform, otherwise this is almost ready to merge.

src/index/IndexImpl.Text.cpp Outdated Show resolved Hide resolved
src/index/IndexImpl.Text.cpp Outdated Show resolved Hide resolved
src/index/TextIndexReadWrite.cpp Show resolved Hide resolved
src/index/TextIndexReadWrite.cpp Outdated Show resolved Hide resolved
src/index/TextIndexReadWrite.cpp Outdated Show resolved Hide resolved
src/index/TextIndexReadWrite.cpp Outdated Show resolved Hide resolved
src/index/TextIndexReadWrite.h Outdated Show resolved Hide resolved
Flixtastic and others added 2 commits January 22, 2025 23:05
…unnecessary construction of spans. Changed manual iterating to compute adjanced differences to using std::adjacent_difference.
…can accept a pointer to write to. Removed unnecessary function writeList and moved it into writeVectorAndMoveOffset which was renamed to encodeAndWriteSpanAndMoveOffset. Removed the communication comments and added correct comments.
@Flixtastic Flixtastic requested a review from joka921 January 24, 2025 02:36
Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some final suggestions to make Sonarcloud happy.

src/index/TextIndexReadWrite.h Outdated Show resolved Hide resolved
@@ -158,9 +275,15 @@ class FrequencyEncode {
template <typename View>
explicit FrequencyEncode(View&& view);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sonarcloud explains its trouble with this constructor pretty closely, and also suggests a solution:

you have to constrain it via a concept or something similar (see the how tab in the sonarcloud explanation on how to best do this). That is the only way to silence that message (which points out a serious potential issue).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I addressed the probelm I got from trying to fix this in the respective email.

src/index/TextIndexReadWrite.h Outdated Show resolved Hide resolved
…cyEncode. Added required to check that View is not of GapEncode or FreqEncode. Added std::forward.
Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should now really be the last round, I have given you all the information you should need.

Comment on lines 274 to 275
requires(!std::same_as<FrequencyEncode, std::remove_cvref_t<View>>)
explicit FrequencyEncode(View&& inputView) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this to the .cpp-File again.

  1. C++20 requires-clauses must be repeatet in the definition, so
    in the .cpp-file you woul also have to write
template <typename View> requires(!std:...) FrequencyEncode(View&&...) {...

2 We are currently backporting QLever to C++ 17, so we use macros that use concepts only in C++17 mode. Please use the CPP_template and CPP_template_def macros, see here for details.
So in the header, you write
CPP_template(typename View)(requires !std::same_as...) exlicit...
And in the .cpp file
CPP_template_def(typename View> (requires ...) FrequencyEncode...

(same for the GapEncode class.

explicit FrequencyEncode(View&& view);
requires(!std::same_as<FrequencyEncode, std::remove_cvref_t<View>>)
explicit FrequencyEncode(View&& inputView) {
auto view = std::forward<View>(inputView);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is absolutely a bad idea, it introduces unneeded copies. Please remove this, thecdecov warnings about no forwarding are fine for now.

Comment on lines 351 to 352
requires(!std::same_as<GapEncode, std::remove_cvref_t<View>>)
explicit GapEncode(View&& inputView) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above

explicit GapEncode(View&& view);
requires(!std::same_as<GapEncode, std::remove_cvref_t<View>>)
explicit GapEncode(View&& inputView) {
auto view = std::forward<View>(inputView);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely same comment as abve.

@sparql-conformance
Copy link

@joka921 joka921 changed the title Outsourcing and improving the posting writing of IndexImpl.Text.cpp Refactor the writing and reading of the fulltext index Jan 30, 2025
@joka921 joka921 merged commit 49934c0 into ad-freiburg:master Jan 30, 2025
24 checks passed
@Flixtastic Flixtastic deleted the text-index-compression branch February 2, 2025 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants