-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add verification tests #379
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #379 +/- ##
==========================================
- Coverage 99.72% 98.47% -1.26%
==========================================
Files 26 26
Lines 1481 1508 +27
==========================================
+ Hits 1477 1485 +8
- Misses 4 23 +19 ☔ View full report in Codecov by Sentry. |
@delucchi-cmu here are some options. Take a look and let me know what you think. It's obviously not fully integrated yet. I did add some data that will make the tests fail; not sure if you want those extra files or if it should somehow use data that's already here. Also, it probably needs to use your custom file pointers and the user-supplied storage kwargs that you support, but I haven't added them yet. Also haven't done docstrings yet (these probably aren't the functions you actually want anyway). There are four tests. The one checking the file schemas against _common_metadata is the thing you actually asked for. It feels a little incomplete if the _common_metadata isn't checked against a schema from the user, but if you don't want to rely on user input here I can understand that. The other tests are code that I had handy that's inline with checking individual parquet files. One is row counts, which I know you're taking care of in a different repo but I think you said that didn't check the individual files? In any case, I'm happy to take all those extra three tests out if you just want to focus on the schema right now. I'm not sure how you want this integrated with the main |
029196d
to
10e5efd
Compare
3855e94
to
8d548ef
Compare
I made choices about some questions from above and filled out this code a little more. So right now there is a verifier = Verifier.from_args(args)
verifier.test_is_valid_catalog() # run hipscat.io.validation.is_valid_catalog
verifier.test_schemas() # user-provided schema vs _common_metadata, _metadata, and file footers
verifier.test_num_rows() # file footers vs _metadata (per file), and user-provided total
verifier.record_results() # write a verification report
verifier.record_distributions() # calculate distributions (min/max of all fields) and write a file |
8d548ef
to
6fd3d27
Compare
6fd3d27
to
5ba3016
Compare
5ba3016
to
857bad9
Compare
superseded by #428 |
Quick-start for anyone wanting to test this code
Code in this PR is in flux. I will update this comment as necessary when I change code. Currently (18 Oct 2024) this works only with HiPSCat (not HATS) catalogs.
Change Description
Closes #118, closes #373, closes #374
Adds the following:
Verifier
classis_valid_catalog
, file sets (_metadata vs files on disk), row counts, row group statistics, schemasVerifierFixture
class and related configs file fixture_defs.yaml. These define theVerifier
instances to be used in unit tests and their expected outcomes. I set it up this way because the grid of options to be unit tested is large.Verifier
test.Stil to-do:
Code Quality
Project-Specific Pull Request Checklists
New Feature Checklist