Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add variable bit-width keys and improved key order for Parquet dict pages #13995

Open
abellina opened this issue Aug 29, 2023 · 10 comments
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@abellina
Copy link
Contributor

abellina commented Aug 29, 2023

Original title:
"[FEA] research enabling BIT_PACKED encoding for columns in parquet writer"

We have a user report of larger sizes for parquet encoded files via the GPU as opposed to Spark CPU. With their sample data, I can get a 30% increase in the GPU file size vs the CPU. I have been able to produce a single row group and the same number of files, so I am down to column encodings. The types of columns are all INT64 nullable columns.

It looks like one of the differences between the two files is that in cuDF columns are not using the BIT_PACKED encoding:

CPU (ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY):

col1:         INT64 SNAPPY DO:4 FPO:508942 SZ:856794/1169887/1.37 VC:822216 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: 1237, max: 1234559, num_nulls: 0]
col2:         INT64 SNAPPY DO:856798 FPO:1365736 SZ:856794/1169887/1.37 VC:822216 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: 1237, max: 1234559, num_nulls: 0]

GPU (ENC:PLAIN_DICTIONARY,RLE):

col1:         INT64 SNAPPY DO:4 FPO:509620 SZ:924686/1234683/1.34 VC:822216 ENC:PLAIN_DICTIONARY,RLE ST:[min: 1237, max: 1234559, num_nulls: 0]
col2:         INT64 SNAPPY DO:924690 FPO:1434330 SZ:924742/1234683/1.34 VC:822216 ENC:PLAIN_DICTIONARY,RLE ST:[min: 1237, max: 1234559, num_nulls: 0]

Discussing with @nvdbaranec, he suggested that BIT_PACKED could be enough of a reason for the difference. I have generated two files with my own mock data (just sequences of longs) and encoded it with the CPU and the GPU. I have placed two of the generated file in this zip file:

bit_packed_example.zip

I would appreciate any comments. If you want me to try a small change in cuDF and rebuild/retest, I am happy to do so.

@abellina abellina added feature request New feature or request Needs Triage Need team to review and classify cuIO cuIO issue Spark Functionality that helps Spark RAPIDS labels Aug 29, 2023
@nvdbaranec
Copy link
Contributor

@vuule

@etseidl
Copy link
Contributor

etseidl commented Sep 5, 2023

Looking at the files, the BIT_PACKED is a bit of a red herring.

    col1 TV=822216 RL=0 DL=1 DS:  102777 DE:PLAIN_DICTIONARY
    ----------------------------------------------------------------------------
    page 0:                        DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY [more]... SZ:7509

The file metadata claims the repetition level data is BIT_PACKED, but the max repetition level is 0, so there is no data to encode. I'm surprised to see BIT_PACKED listed since it's been deprecated for at least a decade now. 😉

Digging deeper, the CPU pages are 7509 bytes for the first 26 pages, and then 10009 bytes for the remaining pages of the column chunk. It seems that the encoder is using a variable bit width for the dictionary encoding. 3 bits for 20000 values would be 7500 bytes, 4 bits for 20000 values is 10000 bytes. (Actually, there are runs of 8 values in the data, so it's really where the dictionary switches from <= 16 bits to >= 17 bits that the bump in page size occurs). libcudf uses a fixed bit width for the entire column chunk based on the number of dictionary keys present. I think it would be a lot of work to use variable bit widths in cudf.

You can try limiting the rowgroup size to 400k rows. That might keep the dictionaries in the 3 16 bit range.

@GregoryKimball
Copy link
Contributor

Thank you @abellina for sharing the file size difference you observed, and thank you @etseidl for your triage of this issue. Once we finish the work around DELTA decoding and encoding, we can consider the feasibility of variable bit width dictionary encoding.

I'll close this issue for now.

@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Sep 27, 2023
@GregoryKimball
Copy link
Contributor

@etseidl thank you for studying this file.

Would you please help me understand your observations a bit better?

I thought that each column chunk has one dictionary page, and this dictionary is used for all of the pages in the column chunk. How could the pages in a column chunk switch between 16-bit and 17-bit dictionaries?

libcudf uses a fixed bit width for the entire column chunk based on the number of dictionary keys present. I think it would be a lot of work to use variable bit widths in cudf.

Would you please share a bit more about this feature idea? How would the encoder choose the bit width for each page?

@etseidl
Copy link
Contributor

etseidl commented Oct 30, 2023

@GregoryKimball, it has to do with the size of the dictionary when the page is encoded. Because parquet-mr is more stream based, it will keep the current dictionary in RAM, and add keys as it goes. So say that the first page has just under 64k distinct entries; in this case the maximum key size will be 16 bits, and the page will be RLE encoded with that bit width. Now while encoding the second page, the number of distinct entries exceeds 64k; the RLE encoder will now use 17 bits. cuDF, on the other hand, computes the dictionary up front, and then uses the total number of entries to determine the bit width to use for all pages.

cuDF could modify the dictionary page-encoder (not dictionary-page encoder 😄) to first find the largest dictionary key for the given page, and use that value to determine how many bits to use when doing the RLE encoding. I'm not sure what that would do to pre-computed page sizes and how the encoded values are stuffed into the column buffer. This could get expensive to compute, but some users might prefer the smallest file possible and be willing to make that trade.

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Oct 30, 2023

cuDF could modify the dictionary page-encoder (not dictionary-page encoder 😄) to first find the largest dictionary key for the given page, and use that value to determine how many bits to use when doing the RLE encoding.

Thank you @etseidl, I appreciate the explanation. If we added this dynamic bit width, we might see smaller file sizes for cuDF than parquet-mr, because parquet-mr can only go up as it writes more pages, but then cuDF could go up or down as needed then cuDF would also be able to change the bit width of data pages based on the largest key value in that page. This also makes me wonder if we could add more tricks with dict key order to yield even smaller files.

I'm not sure what that would do to pre-computed page sizes and how the encoded values are stuffed into the column buffer.

Good points. It seems like it would take more work upfront to re-compute the page sizes depending on the max key.

This could get expensive to compute, but some users might prefer the smallest file possible and be willing to make that trade.

Certainly some users put a huge premium on file size. This also comes up a lot with the nvCOMP team where there are often runtime/filesize tradeoffs.

@etseidl
Copy link
Contributor

etseidl commented Oct 30, 2023

I did a quick test and found that each page winds up with a wide range of keys due to the parallel nature of the dictionary construction. This will need more thought.

@GregoryKimball GregoryKimball moved this to Needs owner in libcudf Nov 7, 2023
@GregoryKimball
Copy link
Contributor

I did a quick test and found that each page winds up with a wide range of keys due to the parallel nature of the dictionary construction. This will need more thought.

Thank you @etseidl for testing this. Do you think we could do better by sorting the keys descending based on the number of occurrences in the column chunk? (and then using dynamic bit width for the pages)

If sorting on number of occurrences doesn't work well, then I suppose we would be stuck with a more difficult optimization such as filling the first 2^16 keys to cover the most number of pages.

Does this sound right to you?

@GregoryKimball GregoryKimball changed the title [FEA] research enabling BIT_PACKED encoding for columns in parquet writer [FEA] Add variable bit-width keys and improved key order for Parquet dict pages Dec 6, 2023
@GregoryKimball GregoryKimball moved this from Needs owner to To be revisited in libcudf Feb 20, 2024
@GregoryKimball
Copy link
Contributor

Looks like this issue has continued interest from the community. It seems like the next steps to consider would be:

  1. Modify the dictionary data-page encoder to find the largest dictionary key for the given page, and use that value to determine how many bits to use when doing the RLE encoding.
  2. Modify the dictionary-page encoder to reorder keys based on their order of appearance or frequency of occurrence.

@GregoryKimball
Copy link
Contributor

Thanks @abellina for sharing the example files with a +7% file size increase between CPU and GPU writers. Are there any details you can share about the data that led to the reported +30% file size increase?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
Status: To be revisited
Development

No branches or pull requests

4 participants