Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] String columns written with fastparquet seem to be read incorrectly via CUDF's Parquet reader #14258

Open
mythrocks opened this issue Oct 5, 2023 · 6 comments
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.

Comments

@mythrocks
Copy link
Contributor

Description

This was uncovered in Spark tests that compare Parquet read/write compatibility with fastparquet.

The last row of a String column written with fastparquet seems to be interpreted by CUDF as having more null characters at the end than expected.

Repro

I'll spare the Scala/Spark details in this bug. Here is a zipped Parquet file that seems to be read differently in CUDF.

From NVIDIA/spark-rapids#9387:

GPU COLUMN LENGTH - NC: 0 DATA: DeviceMemoryBufferView{address=0x30a003400, length=20, id=-1} VAL: DeviceMemoryBufferView{address=0x30a001e00, length=64, id=-1}
COLUMN LENGTH - STRING
0 "all" 616c6c
1 "the" 746865
2 "leaves" 6c65617665730000000000000000

It would be good to check with the CUDF native Parquet reader, and compare against the results from parquet-mr.

@mythrocks mythrocks added bug Something isn't working Needs Triage Need team to review and classify labels Oct 5, 2023
@etseidl
Copy link
Contributor

etseidl commented Oct 7, 2023

Looking at a hex dump of the file, it seems the page data is padded with an extra 8 bytes of zero valued bytes. The string length calculation is likely just using the data length minus 4*num_values to calculate the string buffer length, so the pad bytes are being included in the final string. This is unfortunate because it means the quick short cut to getting the string lengths can't be used and we'll instead always have to do an expensive traversal of the plain encoded string data. :(

TBH I'd consider this a bug on the write side...I'll have to check the spec to see if these padding bytes are forbidden or this is just a gray area.

Looking at fastparquet, it seems the padding was added to get some tests to pass. See this commit (writer.py, lines 452 and 484). I wonder if it's worth bringing up with the fastparquet devs, adding garbage padding to byte array columns should not be necessary.

The relevant line in the current fastparquet is here

@etseidl
Copy link
Contributor

etseidl commented Oct 9, 2023

Just FYI, here's a profile showing the impact of having to do the string size calculation the hard way. The profile shows reading 50M lines from a large parquet file containing plain encoded strings. The top profile is traversing the encoded string data, summing string lengths as it goes. Due to the structure of the data, this cannot be parallelized so a single thread per page is doing this operation. The bottom profile uses the page data size from the header to calculate string sizes. The call to gpuComputeStringSizes in the former takes 848ms, nearly doubling the read time. Overall read time for the entire file (200M rows, 8.6GB) goes from 3.5s to 6.9s.

Screenshot from 2023-10-09 11-53-07

@mythrocks
Copy link
Contributor Author

I'll have to check the spec to see if these padding bytes are forbidden or this is just a gray area.

Thank you for the analysis, @etseidl. This has me curious.

The main reason I considered this might be something we should address is that the Spark Parquet reader, and the parquet tools seem to read the file correctly.

@etseidl
Copy link
Contributor

etseidl commented Oct 10, 2023

The main reason I considered this might be something we should address is that the Spark Parquet reader, and the parquet tools seem to read the file correctly.

Yep, because they're reading a page at a time in batches, so they don't need to worry about exact total sizes, they just read the length of each string as they consume it. Even libcudf as of last year would have read that file ok because the string reads were done in two passes. Now that we have the single pass read, we need to rely on accurate metadata to get the string data copied into the correct places in the column buffer.

As usual, the Parquet spec is silent on this. The closest I could find to an answer is in the section on data pages, where it is stated that the uncompressed_page_size field is the sum of rep level data, def level data, and encoded values. I personally think adding the 8 bytes of padding is kind of hacky, and it's just a happy accident that other readers are ok with that.

Edit: Actually, I missed this "For data pages, the 3 pieces of information are encoded back to back, after the page header. No padding is allowed in the data page."

Anyway, so as to not kill performance for all, would it be acceptable to add an option to do the more expensive string size calculation when necessary?

@GregoryKimball
Copy link
Contributor

Thank you @etseidl and @mythrocks for studying this anomaly. If we did have a reader option to pre-compute sizes, would Spark-RAPIDS have to always set this option to make sure we are correctly avoiding this behavior in the fastparquet writer?

Do you think there could be any sensible postprocessing options to trim the null characters? TBH I didn't even know null characters could exist.

@etseidl
Copy link
Contributor

etseidl commented Oct 12, 2023

If we did have a reader option to pre-compute sizes, would Spark-RAPIDS have to always set this option to make sure we are correctly avoiding this behavior in the fastparquet writer?

Yes...or we could turn it around and make the slow pre-compute the default, and enable the faster version on demand. But if fastparquet fixes their writer (I don't see why they wouldn't, they did not do the padding for V2 pages, for instance), then the cudf reader would be slower by default for no reason. A brittle option would be to check the created_by tag in the metadata and enable the slow reader if fastparquet is detected there.

Do you think there could be any sensible postprocessing options to trim the null characters?

Well, you'd wind up with a column buffer with holes in it. CCCC00CCCCCCC00CCCCCCC. You would have to shift chunks of char data left and update the offsets array to match.

TBH I didn't even know null characters could exist.

They're just byte arrays, with an annotation to interpret as UTF8 strings, so null chars are just fine. The trouble here is fastparquet is adding this padding and counting it in the data size, even though it's not strictly data.

@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Nov 9, 2023
@GregoryKimball GregoryKimball moved this to To be revisited in libcudf Nov 9, 2023
@GregoryKimball GregoryKimball moved this from To be revisited to Needs owner in libcudf Nov 9, 2023
@GregoryKimball GregoryKimball moved this from Needs owner to To be revisited in libcudf Feb 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.
Projects
Status: To be revisited
Development

No branches or pull requests

3 participants