Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Reduce the occurrence of uncompressed pages in Parquet writer #17313

Open
GregoryKimball opened this issue Nov 13, 2024 · 1 comment
Open
Assignees
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Nov 13, 2024

Is your feature request related to a problem? Please describe.
With SNAPPY compression, the Parquet writer can emit a mix of uncompressed and compressed pages. The uncompressed pages are written when the compression ratio is close to 1 for the page to save work during file read.

However, the Parquet reader does not currently coalesce IO between compressed and uncompressed pages, which fragments the IO in many smaller reads instead of a single large read. You can see this effect by using a host buffer data source and inspecting the CUDA HW trace.

nsys profile -t nvtx,cuda,osrt -f true --cuda-memory-usage=true --cuda-um-cpu-page-faults=true --cuda-um-gpu-page-faults=true --gpu-metrics-device=4 --output=pq_coalesce --env-var CUDA_VISIBLE_DEVICES=4 ./PARQUET_READER_NVBENCH -d 0 -b 1 --profile -a io_type=HOST_BUFFER -a compression_type=[SNAPPY,NONE] -a run_length=1 -a cardinality=[0,1000000,100000,1000]

SNAPPY
Image

With compression NONE we write all uncompressed pages, and with compression ZSTD we write all compressed pages, so these formats do not show the same non-coalesced IO pattern.

NONE
Image

ZSTD
Image

Describe the solution you'd like
We could start with simple solutions:

  • force the parquet writer to always write compressed when SNAPPY is selected
  • adjust the size threshold when SNAPPY compression is used to reduce the number of uncompressed pages
  • adjust the compression heuristic to avoid switching between compressed and uncompressed pages

And we can also consider some more complex solutions

  • change the reader to coalesce IO even when there is a mix of compressed and uncompressed pages. This could increase memory footprint, but since the number of uncompressed pages is often small, perhaps the impact will be low.
  • change the reader to coalesce IO even when there is a mix of compressed and uncompressed pages, and then use a DtoD (BatchMemcpy?) to separate the uncompressed and compressed pages

*Performance considerations

  • forcing SNAPPY could result in larger file sizes and longer decompression times. We should check these signals. I hope that the impact is negligible. The file size signal should be low, and the decompression time might be the same just with slightly higher warp occupancy.
  • doing an extra DtoD copy might end up slower than just decompressing more low-ratio pages
  • How common is uncompressed page write in other readers? TBD
@GregoryKimball GregoryKimball added cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. labels Nov 13, 2024
@GregoryKimball
Copy link
Contributor Author

In discussions on 2025-01-08, we came up with the idea to remove the size check altogether. If we remove the size check we could also simplify the interface between writer and nvcomp adapter code.

We should also check ORC writer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

2 participants