[FEA] Add variable bit-width keys and improved key order for Parquet dict pages #13995

abellina · 2023-08-29T16:32:33Z

Original title:
"[FEA] research enabling BIT_PACKED encoding for columns in parquet writer"

We have a user report of larger sizes for parquet encoded files via the GPU as opposed to Spark CPU. With their sample data, I can get a 30% increase in the GPU file size vs the CPU. I have been able to produce a single row group and the same number of files, so I am down to column encodings. The types of columns are all INT64 nullable columns.

It looks like one of the differences between the two files is that in cuDF columns are not using the BIT_PACKED encoding:

CPU (ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY):

col1:         INT64 SNAPPY DO:4 FPO:508942 SZ:856794/1169887/1.37 VC:822216 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: 1237, max: 1234559, num_nulls: 0]
col2:         INT64 SNAPPY DO:856798 FPO:1365736 SZ:856794/1169887/1.37 VC:822216 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: 1237, max: 1234559, num_nulls: 0]

GPU (ENC:PLAIN_DICTIONARY,RLE):

col1:         INT64 SNAPPY DO:4 FPO:509620 SZ:924686/1234683/1.34 VC:822216 ENC:PLAIN_DICTIONARY,RLE ST:[min: 1237, max: 1234559, num_nulls: 0]
col2:         INT64 SNAPPY DO:924690 FPO:1434330 SZ:924742/1234683/1.34 VC:822216 ENC:PLAIN_DICTIONARY,RLE ST:[min: 1237, max: 1234559, num_nulls: 0]

Discussing with @nvdbaranec, he suggested that BIT_PACKED could be enough of a reason for the difference. I have generated two files with my own mock data (just sequences of longs) and encoded it with the CPU and the GPU. I have placed two of the generated file in this zip file:

bit_packed_example.zip

I would appreciate any comments. If you want me to try a small change in cuDF and rebuild/retest, I am happy to do so.

The text was updated successfully, but these errors were encountered:

nvdbaranec · 2023-08-29T16:34:18Z

@vuule

etseidl · 2023-09-05T03:09:30Z

Looking at the files, the BIT_PACKED is a bit of a red herring.

    col1 TV=822216 RL=0 DL=1 DS:  102777 DE:PLAIN_DICTIONARY
    ----------------------------------------------------------------------------
    page 0:                        DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY [more]... SZ:7509

The file metadata claims the repetition level data is BIT_PACKED, but the max repetition level is 0, so there is no data to encode. I'm surprised to see BIT_PACKED listed since it's been deprecated for at least a decade now. 😉

Digging deeper, the CPU pages are 7509 bytes for the first 26 pages, and then 10009 bytes for the remaining pages of the column chunk. It seems that the encoder is using a variable bit width for the dictionary encoding. 3 bits for 20000 values would be 7500 bytes, 4 bits for 20000 values is 10000 bytes. (Actually, there are runs of 8 values in the data, so it's really where the dictionary switches from <= 16 bits to >= 17 bits that the bump in page size occurs). libcudf uses a fixed bit width for the entire column chunk based on the number of dictionary keys present. I think it would be a lot of work to use variable bit widths in cudf.

You can try limiting the rowgroup size to 400k rows. That might keep the dictionaries in the 3 16 bit range.

GregoryKimball · 2023-09-27T02:33:17Z

Thank you @abellina for sharing the file size difference you observed, and thank you @etseidl for your triage of this issue. Once we finish the work around DELTA decoding and encoding, we can consider the feasibility of variable bit width dictionary encoding.

I'll close this issue for now.

GregoryKimball · 2023-10-30T19:13:41Z

@etseidl thank you for studying this file.

Would you please help me understand your observations a bit better?

I thought that each column chunk has one dictionary page, and this dictionary is used for all of the pages in the column chunk. How could the pages in a column chunk switch between 16-bit and 17-bit dictionaries?

libcudf uses a fixed bit width for the entire column chunk based on the number of dictionary keys present. I think it would be a lot of work to use variable bit widths in cudf.

Would you please share a bit more about this feature idea? How would the encoder choose the bit width for each page?

etseidl · 2023-10-30T20:33:02Z

@GregoryKimball, it has to do with the size of the dictionary when the page is encoded. Because parquet-mr is more stream based, it will keep the current dictionary in RAM, and add keys as it goes. So say that the first page has just under 64k distinct entries; in this case the maximum key size will be 16 bits, and the page will be RLE encoded with that bit width. Now while encoding the second page, the number of distinct entries exceeds 64k; the RLE encoder will now use 17 bits. cuDF, on the other hand, computes the dictionary up front, and then uses the total number of entries to determine the bit width to use for all pages.

cuDF could modify the dictionary page-encoder (not dictionary-page encoder 😄) to first find the largest dictionary key for the given page, and use that value to determine how many bits to use when doing the RLE encoding. I'm not sure what that would do to pre-computed page sizes and how the encoded values are stuffed into the column buffer. This could get expensive to compute, but some users might prefer the smallest file possible and be willing to make that trade.

GregoryKimball · 2023-10-30T22:16:48Z

cuDF could modify the dictionary page-encoder (not dictionary-page encoder 😄) to first find the largest dictionary key for the given page, and use that value to determine how many bits to use when doing the RLE encoding.

Thank you @etseidl, I appreciate the explanation. If we added this dynamic bit width, ~~we might see smaller file sizes for cuDF than parquet-mr, because parquet-mr can only go up as it writes more pages, but then cuDF could go up or down as needed~~ then cuDF would also be able to change the bit width of data pages based on the largest key value in that page. This also makes me wonder if we could add more tricks with dict key order to yield even smaller files.

I'm not sure what that would do to pre-computed page sizes and how the encoded values are stuffed into the column buffer.

Good points. It seems like it would take more work upfront to re-compute the page sizes depending on the max key.

This could get expensive to compute, but some users might prefer the smallest file possible and be willing to make that trade.

Certainly some users put a huge premium on file size. This also comes up a lot with the nvCOMP team where there are often runtime/filesize tradeoffs.

etseidl · 2023-10-30T23:29:58Z

I did a quick test and found that each page winds up with a wide range of keys due to the parallel nature of the dictionary construction. This will need more thought.

GregoryKimball · 2023-11-13T19:38:31Z

I did a quick test and found that each page winds up with a wide range of keys due to the parallel nature of the dictionary construction. This will need more thought.

Thank you @etseidl for testing this. Do you think we could do better by sorting the keys descending based on the number of occurrences in the column chunk? (and then using dynamic bit width for the pages)

If sorting on number of occurrences doesn't work well, then I suppose we would be stuck with a more difficult optimization such as filling the first 2^16 keys to cover the most number of pages.

Does this sound right to you?

GregoryKimball · 2025-01-06T18:52:00Z

Looks like this issue has continued interest from the community. It seems like the next steps to consider would be:

Modify the dictionary data-page encoder to find the largest dictionary key for the given page, and use that value to determine how many bits to use when doing the RLE encoding.
Modify the dictionary-page encoder to reorder keys based on their order of appearance or frequency of occurrence.

GregoryKimball · 2025-01-06T19:11:55Z

Thanks @abellina for sharing the example files with a +7% file size increase between CPU and GPU writers. Are there any details you can share about the data that led to the reported +30% file size increase?

abellina added feature request New feature or request Needs Triage Need team to review and classify cuIO cuIO issue Spark Functionality that helps Spark RAPIDS labels Aug 29, 2023

GregoryKimball closed this as completed Sep 27, 2023

GregoryKimball reopened this Sep 27, 2023

GregoryKimball added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Sep 27, 2023

GregoryKimball added this to the Parquet continuous improvement milestone Sep 27, 2023

GregoryKimball added this to libcudf Nov 7, 2023

GregoryKimball moved this to Needs owner in libcudf Nov 7, 2023

GregoryKimball changed the title ~~[FEA] research enabling BIT_PACKED encoding for columns in parquet writer~~ [FEA] Add variable bit-width keys and improved key order for Parquet dict pages Dec 6, 2023

GregoryKimball moved this from Needs owner to To be revisited in libcudf Feb 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add variable bit-width keys and improved key order for Parquet dict pages #13995

[FEA] Add variable bit-width keys and improved key order for Parquet dict pages #13995

abellina commented Aug 29, 2023 •

edited by GregoryKimball

Loading

nvdbaranec commented Aug 29, 2023

etseidl commented Sep 5, 2023 •

edited

Loading

GregoryKimball commented Sep 27, 2023

GregoryKimball commented Oct 30, 2023

etseidl commented Oct 30, 2023

GregoryKimball commented Oct 30, 2023 •

edited

Loading

etseidl commented Oct 30, 2023

GregoryKimball commented Nov 13, 2023

GregoryKimball commented Jan 6, 2025

GregoryKimball commented Jan 6, 2025

[FEA] Add variable bit-width keys and improved key order for Parquet dict pages #13995

[FEA] Add variable bit-width keys and improved key order for Parquet dict pages #13995

Comments

abellina commented Aug 29, 2023 • edited by GregoryKimball Loading

nvdbaranec commented Aug 29, 2023

etseidl commented Sep 5, 2023 • edited Loading

GregoryKimball commented Sep 27, 2023

GregoryKimball commented Oct 30, 2023

etseidl commented Oct 30, 2023

GregoryKimball commented Oct 30, 2023 • edited Loading

etseidl commented Oct 30, 2023

GregoryKimball commented Nov 13, 2023

GregoryKimball commented Jan 6, 2025

GregoryKimball commented Jan 6, 2025

abellina commented Aug 29, 2023 •

edited by GregoryKimball

Loading

etseidl commented Sep 5, 2023 •

edited

Loading

GregoryKimball commented Oct 30, 2023 •

edited

Loading