Skip to content

Commit

Permalink
Merge pull request #12 from NowanIlfideme/docs/add-customization-docs
Browse files Browse the repository at this point in the history
Docs: General doc update. Add customization docs
  • Loading branch information
NowanIlfideme authored Apr 26, 2023
2 parents b79de6a + 7c85e5e commit 7ff43af
Show file tree
Hide file tree
Showing 7 changed files with 221 additions and 71 deletions.
1 change: 1 addition & 0 deletions .markdownlint.yaml
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
# Configuration for markdownlint, specifically markdownlint-cli2
MD052: false # allow things like [PydanticJsonDataSet][pydantic_kedro.PydanticJsonDataSet]
MD013: false # line length
36 changes: 4 additions & 32 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,10 @@ via [Kedro](https://kedro.readthedocs.io/en/stable/index.html) and
This package implements custom Kedro "datasets" for both "pure" and "arbitrary"
Pydantic models.

## Examples
Please see the [docs](https://pydantic-kedro.rtfd.io) for a tutorial and
more examples.

### "Pure" Pydantic Models
## Minimal Example

This example works for "pure", JSON-safe Pydantic models via
`PydanticJsonDataSet`:
Expand All @@ -21,6 +22,7 @@ from pydantic_kedro import PydanticJsonDataSet

class MyPureModel(BaseModel):
"""Your custom Pydantic model with JSON-safe fields."""

x: int
y: str

Expand All @@ -35,33 +37,3 @@ ds.save(obj)
read_obj = ds.load()
assert read_obj.x == 1
```

Note that specifying custom JSON encoders also will work.

### Models with Arbitrary Types

Pydantic [supports models with arbitrary types](https://docs.pydantic.dev/usage/types/#arbitrary-types-allowed)
if you specify it in the model's config.
You can't save/load these via JSON, but you can use the other dataset types,
`PydanticFolderDataSet` and
`PydanticZipDataSet`:

```python
from pydantic import BaseModel
from pydantic_kedro import PydanticJsonDataSet

# TODO

class MyArbitraryModel(BaseModel):
"""Your custom Pydantic model with JSON-unsafe fields."""
x: int
y: str

# TODO
```

## Further Reading

See the [configuration](docs/configuration.md)...

Check out the [API Reference](docs/reference/index.md) for auto-generated docs.
140 changes: 140 additions & 0 deletions docs/arbitrary_types.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
# Serializing Models with Arbitrary Types

Pydantic [supports models with arbitrary types](https://docs.pydantic.dev/usage/types/#arbitrary-types-allowed)
if you specify it in the model's config.
You can't save/load these via JSON, but you can use the other dataset types:
[PydanticFolderDataSet][pydantic_kedro.PydanticFolderDataSet] and
[PydanticZipDataSet][pydantic_kedro.PydanticZipDataSet].

## Usage Example

```python
from tempfile import TemporaryDirectory
from pydantic import BaseModel
from pydantic_kedro import PydanticZipDataSet


class Foo(object):
"""My custom class. NOTE: this is not a Pydantic model!"""

def __init__(self, foo):
self.foo = foo


class MyArbitraryModel(BaseModel):
"""Your custom Pydantic model with JSON-unsafe fields."""

x: int
foo: Foo

class Config:
"""Configuration for Pydantic V1."""
# Let's pretend it would be difficult to add a json encoder for Foo
arbitrary_types_allowed = True


obj = MyArbitraryModel(x=1, foo=Foo("foofoo"))

# This object is not JSON-serializable
try:
obj.json()
except TypeError as err:
print(err) # Object of type 'Foo' is not JSON serializable

# We can, however,
with TemporaryDirectory() as tmpdir:
# Create an on-disk (temporary) file via `fsspec` and save it
ds = PydanticZipDataSet(f"{tmpdir}/arb.zip")
ds.save(obj)

# We can re-load it from the same file
read_obj = ds.load()
assert read_obj.foo.foo == "foofoo"
```

> Note: The above model definition can use [`ArbModel`][pydantic_kedro.ArbModel]
> to save keystrokes:
>
> ```python
> from pydantic_kedro import ArbModel
>
> class MyArbitraryModel(ArbModel):
> """Your custom Pydantic model with JSON-unsafe fields."""
>
> x: int
> foo: Foo
> ```
>
> We will use `ArbModel` as it also gives type hints for the configuration.
## Default Behavior for Unknown Types
The above code gives the following warning:
```python
UserWarning: No dataset defined for __main__.Foo in `Config.kedro_map`;
using `Config.kedro_default`:
<class 'kedro.extras.datasets.pickle.pickle_dataset.PickleDataSet'>
```
This is because `pydantic-kedro` doesn't know how to serialize the object.
The default is Kedro's `PickleDataSet`, which will generally work only if the same
Python version and libraries are installed on the client that reads the dataset.
## Defining Datasets for Types
To let `pydantic-kedro` know how to serialize a class, you need to add it to the
`kedro_map` model config.
Here's a example for [pandas](https://pandas.pydata.org/) and Pydantic V1:
```python
import pandas as pd
from kedro.extras.datasets.pandas import ParquetDataSet
from pydantic import validator
from pydantic_kedro import ArbModel, PydanticZipDataSet
class MyPandasModel(ArbModel):
"""Model that saves a dataframe, along with some other data."""
class Config:
kedro_map = {pd.DataFrame: ParquetDataSet}
val: int
df: pd.DataFrame
@validator('df')
def _check_dataframe(cls, v: pd.DataFrame) -> pd.DataFrame:
"""Ensure the dataframe is valid."""
assert len(v) > 0
return v
dfx = pd.DataFrame([[1, 2, 3]], columns=["a", "b", "c"])
m1 = MyPandasModel(df=dfx, val=1)
ds = PydanticZipDataSet(f"memory://my_model.zip")
ds.save(m1)
m2 = ds.load()
assert m2.df.equals(dfx)
```
Internally, this uses the `ParquetDataSet` to save the dataframe as an
[Apache Parquet](https://parquet.apache.org/) file within the Zip file,
as well as reference it from within the JSON file. That means that, unlike
Pickle, the file isn't "fragile" and will be readable with future versions.
## Known Issues
1. Currently, the `Config` is not magically inherited by subclasses.
That means that you should explicitly inherit `YourType.Config` from `YourType`'s
base class if you want to override it. It also means that the `kedro_map`
isn't merged for subclasses; you'll need to do this explicitly for now.
2. Only the top-level model's `Config` is taken into account when serializing
to a Kedro dataset, ignoring any children's configs.
This means that all values of a particular type are serialized the same way.
3. `pydantic` V2 is not supported yet, but V2
[has a different configuration method](https://docs.pydantic.dev/blog/pydantic-v2-alpha/#changes-to-config).
`pydantic-kedro` might change the configuration method entirely to be more compliant.
3 changes: 0 additions & 3 deletions docs/configuration.md

This file was deleted.

53 changes: 53 additions & 0 deletions docs/implementation_details.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Implementation Details

## File Formats

### JSON Dataset

The [`PydanticJsonDataSet`][pydantic_kedro.PydanticJsonDataSet] dumps your
model as a self-describing JSON file.

In order for the dataset to be self-describing, we add the field `"class"` to your model, which is your class's full import path.

So if you have a Python class defined in `your_module` called `Foo`, the resulting
JSON file will be:

```jsonc
{
"foo": "bar",
// the other fields from your model...
"class": "your_module.Foo"
}
```

> Note: All [`json_encoders`](https://docs.pydantic.dev/usage/exporting_models/#json_encoders)
> defined on your model will still be used.
### Folder and Zip Datasets

The [`PydanticZipDataSet`][pydantic_kedro.PydanticZipDataSet] is based on the
[`PydanticFolderDataSet`][pydantic_kedro.PydanticFolderDataSet] and just zips
the folder.

The directory structure is as the following:

```text
save_dir
|- meta.json
|- .field1
|- .field2.0
| etc.
```

The `meta.json` file has 3 main fields:

1. `"model_class"` is the class import path, as in the [JSON dataset][json-dataset].
2. `"model_info"` is the JSON serialization of the model, except that all
types are "encoded" to the string `"__DATA_PLACEHOLDER__"`.
3. `"catalog"` is the pseudo-definition of the Kedro catalog.
The difference is in the `relative_path` argument.

The rest of the files/folders are the relative paths specified in the `catalog`.

TODO: Is that all? Do we add `model_schema` or something similar?
This is up to change as `pydantic-kedro` gets more mature.
54 changes: 20 additions & 34 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,25 @@ Advanced serialization for [Pydantic](https://docs.pydantic.dev/) models
via [Kedro](https://kedro.readthedocs.io/en/stable/index.html) and
[fsspec](https://filesystem-spec.readthedocs.io/en/latest/).

This package implements custom Kedro "datasets" for both "pure" and "arbitrary"
Pydantic models.
This package implements custom Kedro DataSet types for not only "pure" (JSON-serializable)
Pydantic models, but also models with [`arbitrary_types_allowed`](https://docs.pydantic.dev/usage/types/#arbitrary-types-allowed).

## Examples
Keep reading for a basic tutorial,
or check out the [API Reference](reference/index.md) for auto-generated docs.

### "Pure" Pydantic Models
## Pre-requisites

This example works for "pure", JSON-safe Pydantic models via
[PydanticJsonDataSet][pydantic_kedro.PydanticJsonDataSet]:
To simplify the documentation, we will refer to JSON-serializable Pydantic models
as "pure" models, while all others will be "arbitrary" models.

We also assume you are familiar with [Kedro's Data Catalog](https://docs.kedro.org/en/stable/data/data_catalog.html)
and [Datasets](https://docs.kedro.org/en/stable/data/kedro_io.html).

## "Pure" Pydantic Models

If you have a JSON-safe Pydantic model, you can use a
[PydanticJsonDataSet][pydantic_kedro.PydanticJsonDataSet]
to save your model to any `fsspec`-supported location:

```python
from pydantic import BaseModel
Expand All @@ -21,6 +31,7 @@ from pydantic_kedro import PydanticJsonDataSet

class MyPureModel(BaseModel):
"""Your custom Pydantic model with JSON-safe fields."""

x: int
y: str

Expand All @@ -36,32 +47,7 @@ read_obj = ds.load()
assert read_obj.x == 1
```

Note that specifying custom JSON encoders also will work.

### Models with Arbitrary Types

Pydantic [supports models with arbitrary types](https://docs.pydantic.dev/usage/types/#arbitrary-types-allowed)
if you specify it in the model's config.
You can't save/load these via JSON, but you can use the other dataset types,
[PydanticFolderDataSet][pydantic_kedro.PydanticFolderDataSet] and
[PydanticZipDataSet][pydantic_kedro.PydanticZipDataSet]:

```python
from pydantic import BaseModel
from pydantic_kedro import PydanticJsonDataSet

# TODO

class MyArbitraryModel(BaseModel):
"""Your custom Pydantic model with JSON-unsafe fields."""
x: int
y: str

# TODO
```

## Further Reading

See the [configuration](configuration.md)...
Note that specifying [custom JSON encoders](https://docs.pydantic.dev/usage/exporting_models/#json_encoders) will work as usual.

Check out the [API Reference](reference/index.md) for auto-generated docs.
However, if your custom type is difficult or impossible to encode/decode via
JSON, read on to [Arbitrary Types](./arbitrary_types.md).
5 changes: 3 additions & 2 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,9 @@ use_directory_urls: false

nav:
- Overview: index.md
- Configuration: configuration.md
- Reference: reference/index.md
- Arbitrary Types: arbitrary_types.md
- API Reference: reference/index.md
- Implementation Details: implementation_details.md

theme:
name: "material"
Expand Down

0 comments on commit 7ff43af

Please sign in to comment.