Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support one rust type serializing as many dtypes #15

Open
wants to merge 20 commits into
base: master
Choose a base branch
from

Conversation

ExpHP
Copy link

@ExpHP ExpHP commented Jun 4, 2019

This is now ready for review!

I have chosen to finally submit a PR now that cargo test --all succeeds, and the behavior of all of the existing examples has been verified. Some work still remains to be done.

Overview

  • Adds support for non-little endianness.
  • Adds de/serialization of the following dtypes:
    • Byte strings |Sn and binary blobs |Vn (as Vec<u8>)
    • Datetimes <M8[us] and timedeltas <m8[us] (as u64 and i64)
  • Serializable had to be split into three traits to accomodate these features. More about this later.
  • Adds full parsing and validation of string descrs, hopefully making it easier to add support for complex numbers, bools, np.float16, np.float128, and unicode strings in the future.

This is a large PR. However, as of Thursday June 6, I rewrote the commit history so that each commit can be easily reviewed. (there is no longer anything introduced in one commit that gets changed in another)

Closes #11.
Closes #12.
Closes #19.


I will add a comment which itemizes all of the changes to the public API and the reasoning behind them.

@ExpHP
Copy link
Author

ExpHP commented Jun 4, 2019

Summary of all changes

Additions / Big changes

The new serialization traits

Serializable has been split up quite a bit, because not all operations can be done on all types.

/// Trait that permits reading a type from an `.npy` file.
pub trait Deserialize: Sized {
    /// Think of this as a `Fn(&[u8]) -> Self`, with bonuses.
    ///
    /// Unfortunately, until rust supports existential associated types, actual closures
    /// cannot be used here, and you must define something that manually implements [`TypeRead`].
    type Reader: TypeRead<Value=Self>;

    /// Get a function that deserializes a single data field at a time
    ///
    /// The function receives a byte buffer containing at least
    /// `dtype.num_bytes()` bytes.
    ///
    /// # Errors
    ///
    /// Returns `Err` if the `DType` is not compatible with `Self`.
    fn reader(dtype: &DType) -> Result<Self::Reader, DTypeError>;
}

/// Trait that permits writing a type to an `.npy` file.
pub trait Serialize {
    /// Think of this as some sort of `for<W: io::Write> Fn(W, &Self) -> io::Result<()>`.
    ///
    /// Unfortunately, rust does not have generic closures, so you must manually define
    /// your own implementor of the [`TypeWrite`] trait.
    type Writer: TypeWrite<Value=Self>;

    /// Get a function that serializes a single data field at a time.
    ///
    /// # Errors
    ///
    /// Returns `Err` if the `DType` is not compatible with `Self`.
    fn writer(dtype: &DType) -> Result<Self::Writer, DTypeError>;
}

/// Subtrait of [`Serialize`] for types which have a reasonable default `DType`.
///
/// This opens up some simpler APIs for serialization. (e.g. [`::to_file`])
pub trait AutoSerialize: Serialize {
    /// A suggested format for serialization.
    ///
    /// The builtin implementations for primitive types generally prefer `|` endianness if possible,
    /// else the machine endian format.
    fn default_dtype() -> DType;
}
  • Deserialize is implemented for primitive ints, floats, and Vec<u8>. (but not [u8], which is unsized)
  • Serialize is implemented for primitive ints, floats, Vec<u8>, [u8], and for things behind a variety of pointer types.
  • AutoSerialize is implemented for primitive ints and floats. (but not Vec<u8>/[u8], for which |Sn and |Vn both sound reasonable and both have disadvantages)

Where's n_bytes?

It's on DType now. There is no other way to possibly support e.g. |V42.

impl DType {
    pub fn num_bytes(&self) -> usize;
}

Worth noting is that, unfortunately, this means the compiler can no longer constant-fold the sizes for large records. To mitigate that, we now have...

Two-stage de/serialization

So, what's the deal with the Reader and Writer types? Basically, types now need to validate their DTypes and possibly do different things based on what it contains.

To ensure that this can be done efficiently, de/serialization now takes place in two stages:

  1. DType validation. (and potentially caching useful info like offsets)
  2. The actual reading/writing.

The first step is, of course, done by Serialize::writer and Deserialize::reader already seen above. The second step is done by these methods:

pub trait TypeRead {
    type Value;

    fn read_one<'a>(&self, bytes: &'a [u8]) -> (Self::Value, &'a [u8]);
}

pub trait TypeWrite {
    type Value: ?Sized;

    fn write_one<W: io::Write>(&self, writer: W, value: &Self::Value) -> io::Result<()>
    where Self: Sized;
}

Needless to say, manually implementing these traits has become a fair bit of a chore now. Please see the updated roundtrip example.

The error type exposes a single public constructor for use by manual impls of the traits.

/// Indicates that a particular rust type does not support serialization or deserialization
/// as a given [`DType`].
#[derive(Debug, Clone)]
pub struct DTypeError(ErrorKind);

impl fmt::Display for DTypeError { ... }
impl std::error::Error for DTypeError { ... }

impl DTypeError {
    pub fn custom<S: AsRef<str>>(msg: S) -> Self;
}

One more trait: TypeWriteDyn

There's... this... thing.

pub trait TypeWriteDyn: TypeWrite {
    #[doc(hidden)]
    fn write_one_dyn(&self, writer: &mut dyn io::Write, value: &Self::Value) -> io::Result<()>;
}

impl<T: TypeWrite> TypeWriteDyn for T { }

Long story short: dyn TypeWrite can't do anything, so you should use dyn TypeWriteDyn instead.

I'd remove this from the PR if I could to save it for later consideration... but currently some of the built-in impls use it. I need to do some benchmarking first.

Outfile::open_with_dtype

The existing methods for writing files require AutoSerialize instead of Serialize. One new method, OutFile::open_with_dtype, was added for types that only implement the latter.

impl<Row: AutoSerialize> OutFile<Row> {
    /// Create a file, using the default format for the given type.
    pub fn open<P: AsRef<Path>>(path: P) -> io::Result<Self>;
}

impl<Row: Serialize> OutFile<Row> {
    pub fn open_with_dtype<P: AsRef<Path>>(dtype: &DType, path: P) -> io::Result<Self>;
}

pub fn to_file<S, T, P>(filename: P, data: T) -> ::std::io::Result<()> where
        P: AsRef<Path>,
        S: AutoSerialize,
        T: IntoIterator<Item=S>;

DType::Plain.ty changed from String -> TypeStr

The new TypeStr type is a fully parsed form of a stringlike descr. This is necessary so that reader and writer can easily match on various properties of the descr without having to look at string text.

/// Represents an Array Interface type-string.
///
/// This is more or less the `DType` of a scalar type.
#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Ord, Hash)]
pub struct TypeStr {
    /* no public fields */
}

impl fmt::Display for TypeStr { ... }
impl str::FromStr for TypeStr {
    type Error = ParseTypeStrError;
    ...
}

Parsing one generally produces validation errors for anything not accepted by the np.dtype function.

It currently exposes no public API for inspection or manipulation.

The error is just a garden-variety error type (Debug, Clone, Display, FromStr). I purposefully didn't use nom because nom gives terrible error messages.

There is a helper for "upgrading" a TypeStr to a DType:

impl DType {
    /// Construct a scalar `DType`. (one which is not a nested array or record type)
    pub fn new_scalar(ty: TypeStr) -> Self;
}

"derive" feature

This is a replacement for #[macro_use] extern crate npy_derive. Basically, we don't want #[derive(Serialize)] to clash with serde or other crates, so we instead recommend the following setup to users:

[dependencies]
npy-rs = { version = "0.5", features = ["derive"] }
extern crate npy;

#[derive(npy::Serialize, npy::Deserialize)]
struct MyStruct { a: i32, b: f64 }

Notice the above works even for 2015 edition crates. (the examples and doctests should attest to this!)

NpyData::dtype

impl<'a, T: Deserialize> NpyData<'a, T> {
    pub fn dtype(&self) -> DType;
}

This is necessary in order to be able to read an NPY file and then write a new one back that has the same format. I also wanted it for some of the tests.

Little Things

  • travis now checks that the examples/ build (because I had to update .travis.yml to add the feature flag, and at that point... why not?)
  • DType now derives Clone because I needed it in the derive macro.
  • npy_derive now depends on proc_macro2, whose types appear in the public API of syn and quote
  • unnecessary lifetime deleted from to_file

...and I think that's everything. (whew!)


Edit 1: Updated the signatures of TypeRead to reflect the new, faster API
Edit 2:

  • Removed TypeRead::read_one_into (it can be added backwards compatibly later)
  • Removed "Helpers for creating |Sn and |Vn" (it can be added backwards compatibly later)
  • Added NpyData::dtype

@ExpHP
Copy link
Author

ExpHP commented Jun 4, 2019

Results of the existing bench.rs:

Before

running 2 tests
test read  ... bench:     112,028 ns/iter (+/- 6,674)
test write ... bench:     895,624 ns/iter (+/- 32,575)

After

running 2 tests
test read  ... bench:     463,360 ns/iter (+/- 153,967)   (~4x slowdown)
test write ... bench:   1,596,336 ns/iter (+/- 97,324)    (~2x slowdown)

Yeowch.

This is probably due to the double indirection in the current integer/float impls, which likely prevents inlining. There's no longer any good reason for them to have this indirection since I've decided not to support promotions like u32 -> u64, so we'll see how this improves with just a simple branch on endianness.

Edit: Further benchmarks

Edit: After fixing indirection

running 2 tests
test read  ... bench:     198,521 ns/iter (+/- 12,449)   (x1.75 slowdown)
test write ... bench:   1,092,576 ns/iter (+/- 61,045)   (x1.25 slowdown)

Edit: After adding dedicated Little Endian newtypes

running 2 tests
test le_read  ... bench:     168,137 ns/iter (+/- 9,882)
test le_write ... bench:     948,938 ns/iter (+/- 39,311)

The issue seems to be the fact that it no longer statically knows the strides. I have an idea for how to fix this: read_one will need to become something like

pub trait TypeRead {
    type Value;

    fn read_one<'a>(&self, bytes: &'a [u8]) -> (Self::Value, &'a [u8]);
}

@ExpHP
Copy link
Author

ExpHP commented Jun 5, 2019

The fn read_one(&self, bytes: &[u8]) -> (Self::Value, &[u8]); fix was wildly successful at optimizing the reads for scalar data and derived structs. Now my local branch is actually faster than master at reading most types! (that is, if you trust the current reading benchmarks... which I really don't)

For this reason I've decided against the inclusion of the fixed-endian wrapper types in this PR.

Before:

test array::read      ... bench:   1,325,069 ns/iter (+/- 43,806)
test array::write     ... bench:   4,602,994 ns/iter (+/- 614,125)
test one_field::read  ... bench:      95,717 ns/iter (+/- 4,101)
test one_field::write ... bench:     476,748 ns/iter (+/- 27,877)
test plain_f32::read  ... bench:      86,771 ns/iter (+/- 3,572)
test plain_f32::write ... bench:     561,184 ns/iter (+/- 9,845)
test simple::read     ... bench:     114,684 ns/iter (+/- 6,689)
test simple::write    ... bench:     869,041 ns/iter (+/- 50,057)

After:

test array::read         ... bench:   1,435,941 ns/iter (+/- 48,061)  (10% slowdown)
test array::write        ... bench:   4,567,738 ns/iter (+/- 160,708)
test one_field::read     ... bench:      56,118 ns/iter (+/- 1,440)   (50% speedup)
test one_field::write    ... bench:     477,967 ns/iter (+/- 28,492)
test plain_f32::read     ... bench:      55,995 ns/iter (+/- 3,302)   (50% speedup)
test plain_f32::write    ... bench:     477,693 ns/iter (+/- 12,268)  (15% speedup)
test simple::read        ... bench:      84,084 ns/iter (+/- 4,506)   (30% speedup)
test simple::write       ... bench:   1,091,575 ns/iter (+/- 18,112)  (25% slowdown)

@ExpHP
Copy link
Author

ExpHP commented Jun 6, 2019

I removed a couple of things that can be backwards-compatibly added back later. What remains are basically the things I need for unit tests.

ExpHP added 9 commits June 6, 2019 11:37
This enables these derives to be qualified under `npy`, for
disambiguation from `serde`:

    extern crate npy;

    #[derive(npy::Serialize, npy::Deserialize)]
    struct MyStruct { ... }

This has a couple of downsides with regard to maintainence:

* npy_derive can no longer be a dev-dependency because it must
  become an optional dependency.
* Many tests and examples need the feature.  We need to list all
  of these in Cargo.toml.
* Because this crate is 2015 edition, as soon as we list *any*
  tests and examples, we must list *all* of them; including the
  ones that don't need the feature!

---

This commit had to update `.travis.yml` to start using the feature.
I took this opportunity to also add `--examples` (which the default
script does not do) to ensure that examples build correctly.
The new derive macros will need this...
ExpHP added 5 commits June 6, 2019 14:04
This had to wait until after the derives were added
so that the tests could use the derives.
This is the single most important commit in the PR.
All breaking changes to existing public APIs are contained in here.
Serialize is completely removed.

Examples and tests are not yet updated, so they are broken in this
commit.
Fix a couple of things I missed while rewriting and
reorganizing the commit history.
I tried a variety of things to optimize this function:

* Replacing usage of get_unchecked with reuse of the remainder returned
  by read_one, so that the stride can be statically known rather than
  having to be looked up. (this is what optimized the old read benchmark)
* Putting an assertion up front to prove that the data vector is long enough.

But whatever I do, performance won't budge.  In the f32 benchmark, a very
hot bounds check still occurs on every read to ensure that the length of
the data is at least 4 bytes.

So I'm adding the benchmark, but leaving the function itself alone.
@ExpHP
Copy link
Author

ExpHP commented Jun 6, 2019

This is finished now!

I completely rewrote the git history from scratch, organizing all of the changes into easily reviewable groups. You should be able to read each individual commit one by one and make sense of them.

2015 edition crates do not use NLL yet in the latest stable
compiler, so our derive macro must be conservative.

Apparently, in the latest nightly, this was changed; 2015 edition
will at some point use NLL in the future.  This is why I did not
notice the problem at first!
Copy link
Owner

@potocpav potocpav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, thank you for your heroic effort to improve and generalize this library! And I'm sorry I didn't have time to review your changes earlier.

I didn't review all the commits & changes just yet, but I'm quite sure I agree with all your architectural decisions. Your solution of using two-stage serialization to efficiently support all possible formats is particularly neat.

I commented on a few things I found while browsing the commits, but those are just nit-picks. I will try to finish the review so that we can merge everything ASAP.

Edit: And thanks for the detailed write-up, it really helped understand the changes. :-)

src/type_str.rs Outdated Show resolved Hide resolved
src/serialize.rs Outdated Show resolved Hide resolved
src/serialize.rs Outdated Show resolved Hide resolved
src/serialize.rs Outdated Show resolved Hide resolved
ExpHP and others added 3 commits June 12, 2019 11:45
It was an artefact of an old design.

Out of paranoia, I added some assertions to the Serialize/Deserialize
impls to make sure the endianness is valid.  These are redundant since
it is checked in TypeStr::from_str, but most other such properties are
at least implicitly checked by the `_` arm in impls of `reader` and
`writer` and I wanted to be safe.
I'm not really sure why I had it return a clone in the first place...
@ExpHP
Copy link
Author

ExpHP commented Jun 12, 2019

Two notes:

  • I changed NpyData::dtype to return &DType instead of DType.
  • After this is merged, I would like to submit another PR to add support for n-dimensional arrays before the next version release.
    • Basically, I'm currently prototyping support for shape.len() != 1 on my own fork, and the most reasonable API I was able to come up with makes open_with_dtype redundant (so I'd like to remove it before it becomes part of the published API).

This was referenced Jun 12, 2019
@ExpHP
Copy link
Author

ExpHP commented Jun 21, 2019

@potocpav have you had any time to finish the review?

@ExpHP
Copy link
Author

ExpHP commented Jul 12, 2019

There's a couple of WTFs in this that I am aware about after cleaning them up on my fork. I can include those changes here if you want:

  • Using ByteOrder only for NativeEndian seemed kind of weird and required dumb hacks for u8, so I replaced it with a trait (not publically exposed) for reading primitives. This additionally made the macros not need to know method names or generate modules anymore.

  • impl_integer_serializable! was needlessly complicated, and the recursive loop structure could be replaced with a $()* repetition. The loop was a leftover from a previous design (one where widening conversions were supported, rather than requiring the size to be an exact match).

@ExpHP
Copy link
Author

ExpHP commented Jul 13, 2019

I'm also now having second thoughts about reading DateTimes/TimeDeltas as u64/i64 in this PR.

Basically, I think that at some point I'll want to add back the widening conversions for integers (because integer arrays produced in python code will often have a type that is dynamically chosen between <i4 and <i8), in which case DateTime as u64 doesn't feel right; there should be a dedicated DateTime wrapper type.


I really wish I could find a way to make this PR smaller...

(the only bit that I think can really be pulled out is the derive feature, but that accounts for relatively few changes)

@ExpHP
Copy link
Author

ExpHP commented Jul 15, 2019

Rats, I just realized DateTime should be an i64, not a u64. I'll just remove support for these for now.

I don't want to yet commit to a specific API for serializing
DateTime/TimeDelta, so that we can keep open the option of
widening conversions.
@xd009642
Copy link

Is there any progress on this, just cause I've come across this issue myself trying to deserialise u8 numpy arrays

@ExpHP
Copy link
Author

ExpHP commented Jul 2, 2021

So it is now two years later; Pavel never responded again to this PR or the other, and I found myself needing npy files in rust once again, so I finally came back to work on my fork and have now published it under the name npyz . In addition to the things from this PR, it also has support for:

  • io::Read/io::Write
  • n-dimensional arrays (:o !!!)
  • num-complex

It can be found here: https://github.com/ExpHP/npyz

Hopefully one day these additions can be merged upstream, so that they can finally be under the name npy where people are most likely to look for them; but I've more or less given up on that by this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants