Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Pydantic backend to Data Validation #61

Closed
wants to merge 4 commits into from

Conversation

gregparkes
Copy link
Contributor

TL;DR - This PR is derived from issue #58 to automatically support data validation using Pydantic, a JSON and JSONschema-friendly validation library.

At this point, the PR only defines the schema and basic validations - I have not supplied any means to integrate it into the current library, so all existing behaviour with SigMFFile remains.

Changes

A number of files within the component directory (renamed?), main one being the pydantic_metadata.py script which contains a Pydantic definition from the JSONschema as specified on the main SigMF repository.

The pydantic_metadata.py script defines the SigMF Metadata Standard which includes:

  • SigMFGlobalInfo - global_info
  • SigMFCapture - a single SigMF capture
  • SigMFAnnotation - a single SigMF annotation
  • SigMFMetaFileSchema - a single metadata file (in .sigmf-meta format) containing global, list of captures and list of annotations

Features

To the best of my ability, these classes mirror the defined JSONschema standard and go above and beyond in many ways, including the following features:

  1. core:datatype, version and DOI strings utilise regex patterns to ensure compliance (see pydantic_types.py).
  2. core:version (GlobalInfo), core:uuid (Annotation) and core:datetime (Capture) use default factories to fill automatically upon creation if not defined prior (auto-filling timestamps, version numbers etc)
  3. core:collection, core:dataset and core:license use Pathlib.Path and HttpUrl objects which supply extra functionality from Python core libraries when instantiated.
  4. Index attributes (such as core:sample_start) check for non negative or positive integer.
  5. Validation for mutual exclusivity between core:dataset and core:metadata_only.
  6. Captures and Annotations are automatically sorted by their respective core:sample_start.

How to use

Creating an object

I've added a helper method SigMFMetaFileSchema.from_file() which takes a .sigmf-meta file path and returns the Pydantic object for it.

Using the object

All of the attributes are reachable by using their name, e.g core:version becomes obj.global_info.version.

Exporting an object

Once a SigMFMetaFileSchema object is created, it can be exported to dictionary .model_dump() or JSON string (prior to storage in file, or over the network) using the .model_dump_json(by_alias=True, exclude_none=True) method. Setting by_alias and exclude_none to True is important to ensure the core attributes all begin with core: etc.

Accessing the schema

The JSON schema of the SigMFMetaFileSchema can be accessed using .model_json_schema(), allowing you to integrate with any legacy code using the schema.

Testing

I've supplied some unit tests in which seem to cover the basic cases, although a few extra real examples would be pretty handy, and I haven't properly checked (yet) how its outputs compare to the current outputs from SigMFFile.

Current code coverage results (pytest --cov=sigmf && coverage report):

Name Stmts Miss Branch BrPart Cover
sigmf/component/init.py 1 0 0 0 100%
sigmf/component/extensions/init.py 1 0 0 0 100%
sigmf/component/extensions/core.py 8 0 0 0 100%
sigmf/component/geo_json.py 31 0 8 0 100%
sigmf/component/pydantic_metadata.py 110 0 24 2 99%
sigmf/component/pydantic_types.py 7 0 0 0 100%

My pipeline I've been using is a Python 3.7 environment in Anaconda:

  • black sigmf/component
  • ruff check sigmf/component --fix
  • pylint sigmf/component (gets a 9.95 out of 10 score)
  • mypy -m sigmf raises no errors in my code

Next steps

At the moment there is no code for manipulating the Pydantic objects (aside from creation) to keep controller functionality separate from the 'data' component.

However supplying code to convert these objects into nested dictionaries / to file should be trivial.

Integration

Basically seeking some guidance and ideas as to how to integrate this into existing sigmf-python classes.

I would suggest introducing this as an optional backend in the next version, with it becoming the default option at the next release version.

Something like adding a backend=pydantic parameter to the sigmf.sigmffile.fromfile method or similar.

Also happy for any changes to names / suggestions to file or internal objects.

SigMF Collections

I've began an implementation of the SigMF collection standard, but I'm less familiar with this object so need to play around with it some more.

@777arc
Copy link
Member

777arc commented Jun 20, 2024

Was pydantic_metadata.py entirely auto generated off the json schema or were there any manual tweaks that needed to be made?

@gregparkes
Copy link
Contributor Author

Was pydantic_metadata.py entirely auto generated off the json schema or were there any manual tweaks that needed to be made?

Unfortunately a decent number of manual tweaks needed to be made - in particular the autogeneration tool turned every variable from e.g core:generator in the schema into core_generator as a variable name.

This:

+ Maintains uniqueness of each variable, allows extensions to have the same variable name as a core attribute.
- Makes the variable names longer, which is annoying to write and read.

The tool also generated mostly base Python types (e.g int, str, float) for each attribute and did not supply any special typing e.g regex-compliant strings, positive integers (e.g core:sample_count) and so on.

The custom validation and serialization code associated to each object is also not generated - as a number of the rules are specified in the SigMF standard documentation found here but not actually implemented in the underlying JSON schema - for example sorting the captures and annotations array by core:sample_start, or ensuring core:freq_upper_edge > core:freq_lower_edge. We solve this in Pydantic by ensuring these arrays are sorted in the validation process.

@Teque5
Copy link
Collaborator

Teque5 commented Dec 20, 2024

@gregparkes we appreciate your significant effort implementing pydantic for this module, but after much discussion I believe the changes are currently too onerous to move forward with this PR.

We will reconsider this change after #72 is closed since that will majorly impact the way the schema is handled. The schema will no longer live in this repository but rather pulled during packaging and inserted from the main SigMF repository.

@Teque5 Teque5 closed this Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants