Add verification tests #379

troyraen · 2024-08-14T07:53:34Z

Quick-start for anyone wanting to test this code

Code in this PR is in flux. I will update this comment as necessary when I change code. Currently (18 Oct 2024) this works only with HiPSCat (not HATS) catalogs.

import hipscat_import.verification.run_verification as runner
from hipscat_import.verification.arguments import VerificationArguments

# Change this to point at the HiPSCat catalog's root directory.
# If you have this branch checked out and are in the repo root,
# small_sky_object_catalog can be used for a quick test.
input_catalog_path = "tests/hipscat_import/data/small_sky_object_catalog"
# Directory where you want verification reports written
output_path = "verification_output"
# If you have a parquet schema you want to use as "truth", point this at that file
truth_schema = None
# If you know how many rows SHOULD be in the catalog, enter that number here
truth_total_rows = None

args = VerificationArguments(
    output_path=output_path,
    input_catalog_path=input_catalog_path,
    truth_total_rows=truth_total_rows,
    truth_schema=truth_schema,
)

# Next line will run all verification tests and write a report
verifier = runner.run(args, write_mode="w")
# Look at the results (same content as the written report)
verifier.results_df

# If you prefer to run tests individually, do this:
verifier = runner.Verifier.from_args(args)
verifier.test_file_sets()
verifier.test_is_valid_catalog()
verifier.test_num_rows()
verifier.test_rowgroup_stats()
verifier.test_schemas()

Change Description

My PR includes a link to the issue that I am addressing

Closes #118, closes #373, closes #374

Adds the following:

Verifier class
- verification tests: hipscat is_valid_catalog, file sets (_metadata vs files on disk), row counts, row group statistics, schemas
- output files: verifier_results.csv, field_distributions.csv
Test data for malformed catalogs
- datasets: bad_schemas, no_rowgroup_stats, wrong_files_and_rows
- code used to generate the datasets: generate_malformed_catalogs.py
Pytest fixtures
- VerifierFixture class and related configs file fixture_defs.yaml. These define the Verifier instances to be used in unit tests and their expected outcomes. I set it up this way because the grid of options to be unit tested is large.
Unit tests for each Verifier test.

Stil to-do:

Connect to provenance info
Check code style
Determine whether there is overlap with existing hipscat validation and/or whether any of this code should be moved to hipscat repo.
Test a large catalog and determine whether tests may benefit from parallelization with dask.

Code Quality

I have read the Contribution Guide and LINCC Frameworks Code of Conduct
My code follows the code style of this project
My code builds (or compiles) cleanly without any errors or warnings
My code contains relevant comments and necessary documentation

Project-Specific Pull Request Checklists

New Feature Checklist

I have added or updated the docstrings associated with my feature using the NumPy docstring format
I have updated the tutorial to highlight my new feature (if appropriate)
I have added unit/End-to-End (E2E) test cases to cover my new feature
My change includes a breaking change
- My change includes backwards compatibility and deprecation warnings (if possible)

codecov · 2024-08-14T07:58:31Z

Codecov Report

Attention: Patch coverage is 29.62963% with 19 lines in your changes missing coverage. Please review.

Project coverage is 98.47%. Comparing base (e7f9b4c) to head (029196d).

Files	Patch %	Lines
...rc/hipscat_import/verification/run_verification.py	29.62%	19 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #379      +/-   ##
==========================================
- Coverage   99.72%   98.47%   -1.26%     
==========================================
  Files          26       26              
  Lines        1481     1508      +27     
==========================================
+ Hits         1477     1485       +8     
- Misses          4       23      +19

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

troyraen · 2024-08-14T08:28:08Z

@delucchi-cmu here are some options. Take a look and let me know what you think. It's obviously not fully integrated yet. I did add some data that will make the tests fail; not sure if you want those extra files or if it should somehow use data that's already here. Also, it probably needs to use your custom file pointers and the user-supplied storage kwargs that you support, but I haven't added them yet. Also haven't done docstrings yet (these probably aren't the functions you actually want anyway).

There are four tests. The one checking the file schemas against _common_metadata is the thing you actually asked for. It feels a little incomplete if the _common_metadata isn't checked against a schema from the user, but if you don't want to rely on user input here I can understand that. The other tests are code that I had handy that's inline with checking individual parquet files. One is row counts, which I know you're taking care of in a different repo but I think you said that didn't check the individual files? In any case, I'm happy to take all those extra three tests out if you just want to focus on the schema right now.

I'm not sure how you want this integrated with the main run function. Right now, the functions I wrote return a boolean indicating pass/fail. What do you want it to do if it fails? I'm guessing it should not raise an error. Should it just print messages to std out, or create a report file, or ..?

github-actions · 2024-09-13T05:52:46Z

Before [`829fe47`]	After [`27dad80`]	Ratio	Benchmark (Parameter)
failed	failed	n/a	benchmarks.BinningSuite.time_read_histogram

Click here to view all benchmarks.

troyraen · 2024-09-18T09:45:56Z

I made choices about some questions from above and filled out this code a little more. So right now there is a Verifier class that handles the tests and it's called in the run function with:

verifier = Verifier.from_args(args)
verifier.test_is_valid_catalog()  # run hipscat.io.validation.is_valid_catalog
verifier.test_schemas()  # user-provided schema vs _common_metadata, _metadata, and file footers
verifier.test_num_rows()  # file footers vs _metadata (per file), and user-provided total
verifier.record_results()  # write a verification report

verifier.record_distributions()  # calculate distributions (min/max of all fields) and write a file

troyraen · 2024-11-05T09:05:54Z

superseded by #428

troyraen force-pushed the raen/verify/files branch from 029196d to 10e5efd Compare September 13, 2024 05:46

troyraen changed the title ~~Raen/verify/files~~ Add verification tests Sep 18, 2024

troyraen self-assigned this Sep 18, 2024

troyraen force-pushed the raen/verify/files branch from 3855e94 to 8d548ef Compare September 18, 2024 09:42

troyraen force-pushed the raen/verify/files branch from 8d548ef to 6fd3d27 Compare September 19, 2024 12:20

fixup verifier arguments

c1dd12b

troyraen force-pushed the raen/verify/files branch from 6fd3d27 to 5ba3016 Compare October 16, 2024 13:12

troyraen added 4 commits October 16, 2024 08:59

add Verifier class

6a217bf

add malformed_catalogs test data

f226d2a

add verification fixtures

cafb0fc

add Verifier unit tests

857bad9

troyraen force-pushed the raen/verify/files branch from 5ba3016 to 857bad9 Compare October 16, 2024 15:59

troyraen marked this pull request as ready for review October 16, 2024 16:03

hombit mentioned this pull request Oct 23, 2024

Add basic metadata validation run after import #419

Open

3 tasks

fix up docs

ff88455

troyraen closed this Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add verification tests #379

Add verification tests #379

troyraen commented Aug 14, 2024 •

edited

Loading

codecov bot commented Aug 14, 2024

troyraen commented Aug 14, 2024

github-actions bot commented Sep 13, 2024 •

edited

Loading

troyraen commented Sep 18, 2024

troyraen commented Nov 5, 2024

Add verification tests #379

Add verification tests #379

Conversation

troyraen commented Aug 14, 2024 • edited Loading

Quick-start for anyone wanting to test this code

Change Description

Code Quality

Project-Specific Pull Request Checklists

New Feature Checklist

codecov bot commented Aug 14, 2024

Codecov Report

troyraen commented Aug 14, 2024

github-actions bot commented Sep 13, 2024 • edited Loading

troyraen commented Sep 18, 2024

troyraen commented Nov 5, 2024

troyraen commented Aug 14, 2024 •

edited

Loading

github-actions bot commented Sep 13, 2024 •

edited

Loading