-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #12 from NowanIlfideme/docs/add-customization-docs
Docs: General doc update. Add customization docs
- Loading branch information
Showing
7 changed files
with
221 additions
and
71 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,3 @@ | ||
# Configuration for markdownlint, specifically markdownlint-cli2 | ||
MD052: false # allow things like [PydanticJsonDataSet][pydantic_kedro.PydanticJsonDataSet] | ||
MD013: false # line length |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,140 @@ | ||
# Serializing Models with Arbitrary Types | ||
|
||
Pydantic [supports models with arbitrary types](https://docs.pydantic.dev/usage/types/#arbitrary-types-allowed) | ||
if you specify it in the model's config. | ||
You can't save/load these via JSON, but you can use the other dataset types: | ||
[PydanticFolderDataSet][pydantic_kedro.PydanticFolderDataSet] and | ||
[PydanticZipDataSet][pydantic_kedro.PydanticZipDataSet]. | ||
|
||
## Usage Example | ||
|
||
```python | ||
from tempfile import TemporaryDirectory | ||
from pydantic import BaseModel | ||
from pydantic_kedro import PydanticZipDataSet | ||
|
||
|
||
class Foo(object): | ||
"""My custom class. NOTE: this is not a Pydantic model!""" | ||
|
||
def __init__(self, foo): | ||
self.foo = foo | ||
|
||
|
||
class MyArbitraryModel(BaseModel): | ||
"""Your custom Pydantic model with JSON-unsafe fields.""" | ||
|
||
x: int | ||
foo: Foo | ||
|
||
class Config: | ||
"""Configuration for Pydantic V1.""" | ||
# Let's pretend it would be difficult to add a json encoder for Foo | ||
arbitrary_types_allowed = True | ||
|
||
|
||
obj = MyArbitraryModel(x=1, foo=Foo("foofoo")) | ||
|
||
# This object is not JSON-serializable | ||
try: | ||
obj.json() | ||
except TypeError as err: | ||
print(err) # Object of type 'Foo' is not JSON serializable | ||
|
||
# We can, however, | ||
with TemporaryDirectory() as tmpdir: | ||
# Create an on-disk (temporary) file via `fsspec` and save it | ||
ds = PydanticZipDataSet(f"{tmpdir}/arb.zip") | ||
ds.save(obj) | ||
|
||
# We can re-load it from the same file | ||
read_obj = ds.load() | ||
assert read_obj.foo.foo == "foofoo" | ||
``` | ||
|
||
> Note: The above model definition can use [`ArbModel`][pydantic_kedro.ArbModel] | ||
> to save keystrokes: | ||
> | ||
> ```python | ||
> from pydantic_kedro import ArbModel | ||
> | ||
> class MyArbitraryModel(ArbModel): | ||
> """Your custom Pydantic model with JSON-unsafe fields.""" | ||
> | ||
> x: int | ||
> foo: Foo | ||
> ``` | ||
> | ||
> We will use `ArbModel` as it also gives type hints for the configuration. | ||
## Default Behavior for Unknown Types | ||
The above code gives the following warning: | ||
```python | ||
UserWarning: No dataset defined for __main__.Foo in `Config.kedro_map`; | ||
using `Config.kedro_default`: | ||
<class 'kedro.extras.datasets.pickle.pickle_dataset.PickleDataSet'> | ||
``` | ||
This is because `pydantic-kedro` doesn't know how to serialize the object. | ||
The default is Kedro's `PickleDataSet`, which will generally work only if the same | ||
Python version and libraries are installed on the client that reads the dataset. | ||
## Defining Datasets for Types | ||
To let `pydantic-kedro` know how to serialize a class, you need to add it to the | ||
`kedro_map` model config. | ||
Here's a example for [pandas](https://pandas.pydata.org/) and Pydantic V1: | ||
```python | ||
import pandas as pd | ||
from kedro.extras.datasets.pandas import ParquetDataSet | ||
from pydantic import validator | ||
from pydantic_kedro import ArbModel, PydanticZipDataSet | ||
class MyPandasModel(ArbModel): | ||
"""Model that saves a dataframe, along with some other data.""" | ||
class Config: | ||
kedro_map = {pd.DataFrame: ParquetDataSet} | ||
val: int | ||
df: pd.DataFrame | ||
@validator('df') | ||
def _check_dataframe(cls, v: pd.DataFrame) -> pd.DataFrame: | ||
"""Ensure the dataframe is valid.""" | ||
assert len(v) > 0 | ||
return v | ||
dfx = pd.DataFrame([[1, 2, 3]], columns=["a", "b", "c"]) | ||
m1 = MyPandasModel(df=dfx, val=1) | ||
ds = PydanticZipDataSet(f"memory://my_model.zip") | ||
ds.save(m1) | ||
m2 = ds.load() | ||
assert m2.df.equals(dfx) | ||
``` | ||
Internally, this uses the `ParquetDataSet` to save the dataframe as an | ||
[Apache Parquet](https://parquet.apache.org/) file within the Zip file, | ||
as well as reference it from within the JSON file. That means that, unlike | ||
Pickle, the file isn't "fragile" and will be readable with future versions. | ||
## Known Issues | ||
1. Currently, the `Config` is not magically inherited by subclasses. | ||
That means that you should explicitly inherit `YourType.Config` from `YourType`'s | ||
base class if you want to override it. It also means that the `kedro_map` | ||
isn't merged for subclasses; you'll need to do this explicitly for now. | ||
2. Only the top-level model's `Config` is taken into account when serializing | ||
to a Kedro dataset, ignoring any children's configs. | ||
This means that all values of a particular type are serialized the same way. | ||
3. `pydantic` V2 is not supported yet, but V2 | ||
[has a different configuration method](https://docs.pydantic.dev/blog/pydantic-v2-alpha/#changes-to-config). | ||
`pydantic-kedro` might change the configuration method entirely to be more compliant. |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
# Implementation Details | ||
|
||
## File Formats | ||
|
||
### JSON Dataset | ||
|
||
The [`PydanticJsonDataSet`][pydantic_kedro.PydanticJsonDataSet] dumps your | ||
model as a self-describing JSON file. | ||
|
||
In order for the dataset to be self-describing, we add the field `"class"` to your model, which is your class's full import path. | ||
|
||
So if you have a Python class defined in `your_module` called `Foo`, the resulting | ||
JSON file will be: | ||
|
||
```jsonc | ||
{ | ||
"foo": "bar", | ||
// the other fields from your model... | ||
"class": "your_module.Foo" | ||
} | ||
``` | ||
|
||
> Note: All [`json_encoders`](https://docs.pydantic.dev/usage/exporting_models/#json_encoders) | ||
> defined on your model will still be used. | ||
### Folder and Zip Datasets | ||
|
||
The [`PydanticZipDataSet`][pydantic_kedro.PydanticZipDataSet] is based on the | ||
[`PydanticFolderDataSet`][pydantic_kedro.PydanticFolderDataSet] and just zips | ||
the folder. | ||
|
||
The directory structure is as the following: | ||
|
||
```text | ||
save_dir | ||
|- meta.json | ||
|- .field1 | ||
|- .field2.0 | ||
| etc. | ||
``` | ||
|
||
The `meta.json` file has 3 main fields: | ||
|
||
1. `"model_class"` is the class import path, as in the [JSON dataset][json-dataset]. | ||
2. `"model_info"` is the JSON serialization of the model, except that all | ||
types are "encoded" to the string `"__DATA_PLACEHOLDER__"`. | ||
3. `"catalog"` is the pseudo-definition of the Kedro catalog. | ||
The difference is in the `relative_path` argument. | ||
|
||
The rest of the files/folders are the relative paths specified in the `catalog`. | ||
|
||
TODO: Is that all? Do we add `model_schema` or something similar? | ||
This is up to change as `pydantic-kedro` gets more mature. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters