Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add Parquet and ORC unit tests based on Apache sample files #13627

Open
GregoryKimball opened this issue Jun 27, 2023 · 0 comments
Open
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS tests Unit testing for project

Comments

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Jun 27, 2023

During the 23.06 release, we encountered several important Parquet and ORC writer issues that risked data corruption. These issues included:

  • Rare failure with page size estimator (PQ writer, Report, Fix)
  • Failure with >1GB tables (PQ writer, Report, Fix)
  • Failure with 10k nulls followed by >5 valid values (ORC Writer, Report, Fix)

After discussion with the team we agreed on these additions to our testing suite to help prevent similar issues in the future:

  • Based on test files in parquet-testing/data, verify that "read" versus "read-write-read" result in identical tables
  • Based on test files in orc/examples, verify that "read" versus "read-write-read" result in identical tables
  • Based on test files in parquet-testing/data, verify that "read" versus "read_with_Arrow-convert_to_cudf" result in identical tables
  • Based on test files in orc/examples, verify that "read" versus "read_with_Arrow-convert_to_cudf" result in identical tables

Note: please also see (#12739), for reader benchmarks, verify that the roundtripped table matches the starting table

@GregoryKimball GregoryKimball added feature request New feature or request 0 - Backlog In queue waiting for assignment tests Unit testing for project libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Spark Functionality that helps Spark RAPIDS labels Jun 27, 2023
@GregoryKimball GregoryKimball moved this to Needs owner in libcudf Jun 27, 2023
@GregoryKimball GregoryKimball removed the status in libcudf Jul 5, 2023
@GregoryKimball GregoryKimball removed this from libcudf Oct 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS tests Unit testing for project
Projects
None yet
Development

No branches or pull requests

1 participant