[BUG] String columns written with `fastparquet` seem to be read incorrectly via CUDF's Parquet reader #14258

mythrocks · 2023-10-05T23:20:44Z

Description

This was uncovered in Spark tests that compare Parquet read/write compatibility with fastparquet.

The last row of a String column written with fastparquet seems to be interpreted by CUDF as having more null characters at the end than expected.

Repro

I'll spare the Scala/Spark details in this bug. Here is a zipped Parquet file that seems to be read differently in CUDF.

From NVIDIA/spark-rapids#9387:

GPU COLUMN LENGTH - NC: 0 DATA: DeviceMemoryBufferView{address=0x30a003400, length=20, id=-1} VAL: DeviceMemoryBufferView{address=0x30a001e00, length=64, id=-1}
COLUMN LENGTH - STRING
0 "all" 616c6c
1 "the" 746865
2 "leaves" 6c65617665730000000000000000

It would be good to check with the CUDF native Parquet reader, and compare against the results from parquet-mr.

The text was updated successfully, but these errors were encountered:

etseidl · 2023-10-07T20:37:35Z

Looking at a hex dump of the file, it seems the page data is padded with an extra 8 bytes of zero valued bytes. The string length calculation is likely just using the data length minus 4*num_values to calculate the string buffer length, so the pad bytes are being included in the final string. This is unfortunate because it means the quick short cut to getting the string lengths can't be used and we'll instead always have to do an expensive traversal of the plain encoded string data. :(

TBH I'd consider this a bug on the write side...I'll have to check the spec to see if these padding bytes are forbidden or this is just a gray area.

Looking at fastparquet, it seems the padding was added to get some tests to pass. See this commit (writer.py, lines 452 and 484). I wonder if it's worth bringing up with the fastparquet devs, adding garbage padding to byte array columns should not be necessary.

The relevant line in the current fastparquet is here

etseidl · 2023-10-09T19:20:44Z

Just FYI, here's a profile showing the impact of having to do the string size calculation the hard way. The profile shows reading 50M lines from a large parquet file containing plain encoded strings. The top profile is traversing the encoded string data, summing string lengths as it goes. Due to the structure of the data, this cannot be parallelized so a single thread per page is doing this operation. The bottom profile uses the page data size from the header to calculate string sizes. The call to gpuComputeStringSizes in the former takes 848ms, nearly doubling the read time. Overall read time for the entire file (200M rows, 8.6GB) goes from 3.5s to 6.9s.

mythrocks · 2023-10-10T05:22:59Z

I'll have to check the spec to see if these padding bytes are forbidden or this is just a gray area.

Thank you for the analysis, @etseidl. This has me curious.

The main reason I considered this might be something we should address is that the Spark Parquet reader, and the parquet tools seem to read the file correctly.

etseidl · 2023-10-10T05:47:39Z

The main reason I considered this might be something we should address is that the Spark Parquet reader, and the parquet tools seem to read the file correctly.

Yep, because they're reading a page at a time in batches, so they don't need to worry about exact total sizes, they just read the length of each string as they consume it. Even libcudf as of last year would have read that file ok because the string reads were done in two passes. Now that we have the single pass read, we need to rely on accurate metadata to get the string data copied into the correct places in the column buffer.

As usual, the Parquet spec is silent on this. The closest I could find to an answer is in the section on data pages, where it is stated that the uncompressed_page_size field is the sum of rep level data, def level data, and encoded values. I personally think adding the 8 bytes of padding is kind of hacky, and it's just a happy accident that other readers are ok with that.

Edit: Actually, I missed this "For data pages, the 3 pieces of information are encoded back to back, after the page header. No padding is allowed in the data page."

Anyway, so as to not kill performance for all, would it be acceptable to add an option to do the more expensive string size calculation when necessary?

GregoryKimball · 2023-10-12T18:24:25Z

Thank you @etseidl and @mythrocks for studying this anomaly. If we did have a reader option to pre-compute sizes, would Spark-RAPIDS have to always set this option to make sure we are correctly avoiding this behavior in the fastparquet writer?

Do you think there could be any sensible postprocessing options to trim the null characters? TBH I didn't even know null characters could exist.

etseidl · 2023-10-12T18:39:21Z

If we did have a reader option to pre-compute sizes, would Spark-RAPIDS have to always set this option to make sure we are correctly avoiding this behavior in the fastparquet writer?

Yes...or we could turn it around and make the slow pre-compute the default, and enable the faster version on demand. But if fastparquet fixes their writer (I don't see why they wouldn't, they did not do the padding for V2 pages, for instance), then the cudf reader would be slower by default for no reason. A brittle option would be to check the created_by tag in the metadata and enable the slow reader if fastparquet is detected there.

Do you think there could be any sensible postprocessing options to trim the null characters?

Well, you'd wind up with a column buffer with holes in it. CCCC00CCCCCCC00CCCCCCC. You would have to shift chunks of char data left and update the offsets array to match.

TBH I didn't even know null characters could exist.

They're just byte arrays, with an annotation to interpret as UTF8 strings, so null chars are just fine. The trouble here is fastparquet is adding this padding and counting it in the data size, even though it's not strictly data.

mythrocks added bug Something isn't working Needs Triage Need team to review and classify labels Oct 5, 2023

mythrocks mentioned this issue Oct 6, 2023

Add tests to check compatibility with fastparquet NVIDIA/spark-rapids#9366

Merged

GregoryKimball added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Nov 9, 2023

GregoryKimball added this to libcudf Nov 9, 2023

GregoryKimball moved this to To be revisited in libcudf Nov 9, 2023

GregoryKimball moved this from To be revisited to Needs owner in libcudf Nov 9, 2023

GregoryKimball added this to the Parquet continuous improvement milestone Feb 15, 2024

GregoryKimball moved this from Needs owner to To be revisited in libcudf Feb 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] String columns written with `fastparquet` seem to be read incorrectly via CUDF's Parquet reader #14258

[BUG] String columns written with `fastparquet` seem to be read incorrectly via CUDF's Parquet reader #14258

mythrocks commented Oct 5, 2023

etseidl commented Oct 7, 2023 •

edited

Loading

etseidl commented Oct 9, 2023

mythrocks commented Oct 10, 2023

etseidl commented Oct 10, 2023 •

edited

Loading

GregoryKimball commented Oct 12, 2023

etseidl commented Oct 12, 2023

[BUG] String columns written with fastparquet seem to be read incorrectly via CUDF's Parquet reader #14258

[BUG] String columns written with fastparquet seem to be read incorrectly via CUDF's Parquet reader #14258

Comments

mythrocks commented Oct 5, 2023

etseidl commented Oct 7, 2023 • edited Loading

etseidl commented Oct 9, 2023

mythrocks commented Oct 10, 2023

etseidl commented Oct 10, 2023 • edited Loading

GregoryKimball commented Oct 12, 2023

etseidl commented Oct 12, 2023

[BUG] String columns written with `fastparquet` seem to be read incorrectly via CUDF's Parquet reader #14258

[BUG] String columns written with `fastparquet` seem to be read incorrectly via CUDF's Parquet reader #14258

etseidl commented Oct 7, 2023 •

edited

Loading

etseidl commented Oct 10, 2023 •

edited

Loading