Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support uncompressed or fully in-memory vocabularies #1740

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

joka921
Copy link
Member

@joka921 joka921 commented Jan 31, 2025

So far, QLever's vocabulary (the set of distinct IRIs and literals) is compressed (on the level of individual IRIs and literals) and resides on disk by default. It can be configured that certain IRIs or literals reside in RAM. However, there was some overhead for determining whether an IRI or literal resides in memory or on disk. With this change, a free choice between the following four kinds of vocabularies:

  1. On disk and compressed (the default so far)
  2. On disk and uncompressed (costs more space but saves time for decompression)
  3. Fully in memory and compressed (faster than disk but memory is expensive)
  4. Fully in memory and uncompressed (maximally fast but most memory hungry)

Performance tests for exporting a TSV with 100M triples and a total size of 8.8 GB yielded the following result on a Ryzen 9 9950X with 4 x WD_BLACK SN850X NVMe RAID0 (column 2) and a Ryzen 5900X with 1 x TODO NVMe (column 3):

| on disk, compressed | 65s | 177s |
| on disk, uncompressed | 58s | 151s |
| in memory, compressed | 51s | 110s |
| in memory, uncompressed | 50s | 109s |

When measuring the performance multiple times in a row, so that the relevant parts of the disk-based vocabulary are cached by the operating system, the measurements from the first two rows decrease slightly (this effect is more pronounced when exporting smaller results).

BOTTOM LINE: An in-memory vocabulary is 40 - 60% faster for a single SSD, but only 5 - 15% faster for four parallel SSDs. Using compression is around 10% slower on disk and comes at almost no cost in memory.

Copy link

codecov bot commented Jan 31, 2025

Codecov Report

Attention: Patch coverage is 99.43182% with 1 line in your changes missing coverage. Please review.

Project coverage is 90.03%. Comparing base (098b79c) to head (d8080b3).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
src/index/Vocabulary.cpp 90.90% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1740      +/-   ##
==========================================
+ Coverage   90.01%   90.03%   +0.01%     
==========================================
  Files         395      398       +3     
  Lines       37793    37954     +161     
  Branches     4257     4262       +5     
==========================================
+ Hits        34019    34171     +152     
- Misses       2478     2482       +4     
- Partials     1296     1301       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

TODO:
Make the vocabulary implementation be choosable from CMake

Signed-off-by: Johannes Kalmbach <[email protected]>
Signed-off-by: Johannes Kalmbach <[email protected]>
@joka921 joka921 marked this pull request as ready for review February 5, 2025 17:54
Copy link
Member

@hannahbast hannahbast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick 1-1 with Johannes, this looks great and it works!

@hannahbast hannahbast changed the title Also support uncompressed or in Memory vocabularies Also support uncompressed or in-memory vocabularies Feb 5, 2025
@hannahbast hannahbast changed the title Also support uncompressed or in-memory vocabularies Support uncompressed or fully in-memory vocabularies Feb 6, 2025
Signed-off-by: Johannes Kalmbach <[email protected]>
Signed-off-by: Johannes Kalmbach <[email protected]>
@sparql-conformance
Copy link

Copy link

sonarqubecloud bot commented Feb 6, 2025

Quality Gate Failed Quality Gate failed

Failed conditions
E Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants