Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement default time resolution for CF time encoding/decoding #9580

Closed
wants to merge 6 commits into from

Conversation

kmuehlbauer
Copy link
Contributor

@kmuehlbauer kmuehlbauer commented Oct 4, 2024

  • Closes #xxxx
  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

This is a first attempt to get behind #7493 and #9154.

  • added time_resolution option to OPTIONS to be set with xr.set_options(time_resolution="us") (defaulting to ns)
  • refactored nanosecond_precision_timestamp to default_precision_timestamp, this now returns lowest needed resolution or default resolution (if that is higher)
  • refactored decode_datetime_with_pandas to accomodate for non-naosecond treatment,
    as pd.to_timedelta returns nanosecond-resolution in any case the processing had to be adapted in case of non-nanosecond timedeltas
  • adapted some warnings

Besides enabling the changed warnings no tests have been changed so far, everything should work as expected with the default resolution of ns.

This already works for encoding/decoding with different default resolutions (but, note the datetime64 issue for "0001-01-01T00:00:00.000000000"):

for tu in ["s", "ms", "us", "ns"]:
    print("-----------------------------------------------------------------------------------")
    times = [
        np.datetime64("0001-01-01T00:00:00.000000000", tu), 
        np.datetime64("1677-09-21T00:12:43.145224193", tu), 
        np.datetime64("1970-01-01T00:00:00", tu), 
        np.datetime64("NaT")
    ]
    encoding = dict(dtype="int64", _FillValue=20)
    with xr.set_options(time_resolution=tu):
        var = xr.Variable(["time"], times, encoding=encoding)
        assert var.dtype == np.dtype(f"=M8[{tu}]")
        print(var)
        encoded_var = xr.conventions.encode_cf_variable(var)
        print(encoded_var)
        decoded_var = xr.conventions.decode_cf_variable("foo", encoded_var)
        assert decoded_var.dtype == np.dtype(f"=M8[{tu}]")
        print(decoded_var.load())
        print(var.encoding, decoded_var.encoding)
Output
-----------------------------------------------------------------------------------
<xarray.Variable (time: 4)> Size: 32B
array(['0001-01-01T00:00:00', '1677-09-21T00:12:43',
       '1970-01-01T00:00:00',                 'NaT'],
      dtype='datetime64[s]')
<xarray.Variable (time: 4)> Size: 32B
array([          0, 52912224763, 62135596800,          20])
Attributes:
    units:       seconds since 0001-01-01 00:00:00
    calendar:    proleptic_gregorian
    _FillValue:  20
<xarray.Variable (time: 4)> Size: 32B
array(['0001-01-01T00:00:00', '1677-09-21T00:12:43',
       '1970-01-01T00:00:00',                 'NaT'],
      dtype='datetime64[s]')
{'dtype': 'int64', '_FillValue': 20} {'_FillValue': np.int64(20), 'units': 'seconds since 0001-01-01 00:00:00', 'calendar': 'proleptic_gregorian', 'dtype': dtype('int64')}
-----------------------------------------------------------------------------------
<xarray.Variable (time: 4)> Size: 32B
array(['0001-01-01T00:00:00.000', '1677-09-21T00:12:43.145',
       '1970-01-01T00:00:00.000',                     'NaT'],
      dtype='datetime64[ms]')
<xarray.Variable (time: 4)> Size: 32B
array([             0, 52912224763145, 62135596800000,             20])
Attributes:
    units:       milliseconds since 0001-01-01 00:00:00
    calendar:    proleptic_gregorian
    _FillValue:  20
<xarray.Variable (time: 4)> Size: 32B
array(['0001-01-01T00:00:00.000', '1677-09-21T00:12:43.145',
       '1970-01-01T00:00:00.000',                     'NaT'],
      dtype='datetime64[ms]')
{'dtype': 'int64', '_FillValue': 20} {'_FillValue': np.int64(20), 'units': 'milliseconds since 0001-01-01 00:00:00', 'calendar': 'proleptic_gregorian', 'dtype': dtype('int64')}
-----------------------------------------------------------------------------------
<xarray.Variable (time: 4)> Size: 32B
array(['0001-01-01T00:00:00.000000', '1677-09-21T00:12:43.145224',
       '1970-01-01T00:00:00.000000',                        'NaT'],
      dtype='datetime64[us]')
<xarray.Variable (time: 4)> Size: 32B
array([                0, 52912224763145224, 62135596800000000,
                      20])
Attributes:
    units:       microseconds since 0001-01-01 00:00:00
    calendar:    proleptic_gregorian
    _FillValue:  20
<xarray.Variable (time: 4)> Size: 32B
array(['0001-01-01T00:00:00.000000', '1677-09-21T00:12:43.145224',
       '1970-01-01T00:00:00.000000',                        'NaT'],
      dtype='datetime64[us]')
{'dtype': 'int64', '_FillValue': 20} {'_FillValue': np.int64(20), 'units': 'microseconds since 0001-01-01 00:00:00', 'calendar': 'proleptic_gregorian', 'dtype': dtype('int64')}
-----------------------------------------------------------------------------------
<xarray.Variable (time: 4)> Size: 32B
array(['1754-08-30T22:43:41.128654848', '1677-09-21T00:12:43.145224193',
       '1970-01-01T00:00:00.000000000',                           'NaT'],
      dtype='datetime64[ns]')
<xarray.Variable (time: 4)> Size: 32B
array([                   0, -2428007457983430655,  6795364578871345152,
                         20])
Attributes:
    units:       nanoseconds since 1754-08-30 22:43:41.128654848
    calendar:    proleptic_gregorian
    _FillValue:  20
<xarray.Variable (time: 4)> Size: 32B
array(['1754-08-30T22:43:41.128654848', '1677-09-21T00:12:43.145224193',
       '1970-01-01T00:00:00.000000000',                           'NaT'],
      dtype='datetime64[ns]')
{'dtype': 'int64', '_FillValue': 20} {'_FillValue': np.int64(20), 'units': 'nanoseconds since 1754-08-30 22:43:41.128654848', 'calendar': 'proleptic_gregorian', 'dtype': dtype('int64')}

Enabling the proposed changes would allow us to (at least) get feedback which of their workflows already work if they use another default resolution and which not. It would also help us in investigation solutions to fully resolve #7493. I've already tried to come up with a fully automagic version, but that didn't work out (yet) more or less because the current tests are not easy to transform to different default resolution.

Anyway, it would be great to get any feedback here. Pinging some folks who might have interest, @spencerkclark, @ChrisBarker-NOAA, @rabernat, @khider, please include other interested parties, too.

@kmuehlbauer kmuehlbauer force-pushed the default-time-resolution branch from 2ecb52b to b0a325d Compare October 4, 2024 12:45
@kmuehlbauer
Copy link
Contributor Author

There are some issues with pandas.Datetime having the unit/as_unit() missing for 2.0 until 2.2. Not sure how to handle the mypy errors for that.

Another minor issue is for the older py3.10 based envs where one error isn't raised any more and in the other cftime-processing is requested (by raising OutOfBoundsError) but cftime isN#t installed.

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Oct 6, 2024

Is set_options going to be compatible with free-threaded python?

Genereally, I've avoided this kind of "global option setting" in my code, the results are too surprising.
However I understand why you might be tempted to do it here.

@kmuehlbauer
Copy link
Contributor Author

@hmaarrfk Currently "ns" is the default. This PR is meant to allow enabling other resolutions, without breaking backwards compatibility for the default "ns" resolution.

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Oct 6, 2024

but this isn't quite correct.

the "with" context managers are designed to make you feel "safe" when using them. The set, and unset variables.

Consider workers using it to "temporarily change the resolution" during their computation.

Now you spawn them. Each will enter a context manager. Will you return to the original global state when they all exit?

@kmuehlbauer
Copy link
Contributor Author

I see, thanks for highlighting this.

For that use case you would need to set the options globally (without the context manager). See https://docs.xarray.dev/en/stable/generated/xarray.set_options.html.

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Oct 6, 2024

It just gets kinda annoying to explain why libraries should not be using this new with mechanism.

In either case, I typically avoid it alltogether and specify things via kwargs.

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Oct 6, 2024

I think the ability to set defaults is "interesting", but I don't think I'll be a user of it. Thats all I am trying to say.

@kmuehlbauer
Copy link
Contributor Author

Thanks for sharing your concerns here, @hmaarrfk.

The proposed changes are more or less just for testing out other resolutions, to have a playground for use cases and workflows to finally resolve #7493.

So, finally, after resolving all issues, adapting tests etc.pp., this set_options should not be needed any more. Does that make sense?

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Oct 6, 2024

The proposed changes are more or less just for testing out other resolutions

this might in fact be helpful in a test suite.

Copy link
Member

@spencerkclark spencerkclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow thanks for digging into this Kai! This approach does seem like it could be a helpful bridge toward a more fully-formed solution. I haven't had a chance to think too deeply about the bigger picture here yet, but I left a couple initial questions.

This already works for encoding/decoding with different default resolutions (but, note the datetime64 issue for "0001-01-01T00:00:00.000000000")

Right, in the "ns" case since "0001-01-01T00:00:00.000000000" cannot be represented with a datetime64[ns] value, it has already experienced overflow before xarray sees it.

else np.datetime_data(ref_date.asm8)[0]
)
ref_date_delta = np.timedelta64(1, ref_date_unit).astype("timedelta64[ns]")
time_delta = np.timedelta64(1, time_units).astype("timedelta64[ns]")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered the case where flat_num_dates has a floating point dtype? In that situation one could end up needing finer precision than the time_units specified in the encoding.

But maybe another question at this stage is whether we would want the encoding information (reference date or time units) to affect the datetime precision we decode to—currently it appears one could end up with non-default precision if the time units in the file specified something finer than the user-specified precision via set_options, which could be surprising.

In the long term, however, I agree that it would be nice to have some way of inferring this automatically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective the decoding process should just do what it is supposed to do , which is decode into the designated datetimes as given by time_unit since reference_date. As our final dtype (np.datetime64) is based on unix epoch (1970-01-01, with different possible resolutions) we need to find the fitting resolution.

days since 1970-01-01 00:00:00 -> "s" (reference_date takes precedence over time_unit since it is in second resolution)
seconds since 1970-01-01 -> "s" (time_unit take precedence over reference_date)
milliseconds since 1970-01-01 00:00:00.000000001 -> "ns" (reference_date takes precedence over time_unit)

For now we have a third possible element to consider, the default resolution, which currently is "ns" (or might be changed to something else now).

The current proposed workflow with the default resolution should finally not be needed any more, when all issues have been ironed out. Or, we use something like None to specify automatic inferring.

But maybe another question at this stage is whether we would want the encoding information (reference date or time units) to affect the datetime precision we decode to—currently it appears one could end up with non-default precision if the time units in the file specified something finer than the user-specified precision via set_options, which could be surprising.

Maybe the idea of a default resolution should only come into consideration in the encoding process, when the user does not specify any units?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or no default resolution at all and try to infer everything (if that's possible at all)?

A good example showing the ambiguities between on-disk and in-memory is something like

  • on-disk: [0, 1, 2, 3] with days since 1970-01-01 12:00:00
    will translate to:
  • in-memory: [12, 36, 60, 84] with datetime64[h]
  • in_memory default "s": [ 43200, 129600, 216000, 302400] with datetime64[s]

A default resolution of seconds s would be not problem for the above, but for this example:

  • on-disk: [0, 1, 2, 3] with seconds since 1970-01-01 00:00:00.001
    will translate to:
  • in-memory: [1, 1001, 2001, 3001] with datetime64[ms]
  • in_memory default "s": [0, 1, 2, 3] with datetime64[s]

we would run into issues. The more I think about this, the more think inferring as much as possible is the way to go, to not round/cutoff or otherwise get time mismatches between on-disk and in-memory.

Copy link
Member

@spencerkclark spencerkclark Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed I agree long-term some kind of inference would be best—ideally we do not lose precision based on what was written on disk. It is just somewhat complicated (in the case of floating point times, one would need to take into account the values themselves in addition to the metadata), but we can try to hash out now what that could look like (or punt and just say floats are always decoded to a user-configurable precision, defaulting to nanosecond).

For context, I interpreted this PR as temporarily aiming to maintain a consistent NumPy datetime precision throughout xarray, while at least allowing for some flexibility (hence the modifications to the casting logic in variable.py to also respect the time_resolution set in set_options), so it was surprising to see one place where datetime precisions other than that specified in set_options could appear.

In your eyes, at what point do we remove the casting logic and fully allow mixed precision datetime values? Is this merely just an issue with the tests at this stage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spencerkclark Sorry, went down that rabbit hole the last days. I think I'm on a good way to have things sorted out. There are different locations in the code and tests were the gears mesh together.

The overall decoding and encoding logic seem to work in my local feature branch. I'll update this PR when I've fixed all the bits and pieces. Need to dig further now, back into the hole 🐰

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries—if at any point you feel it would be helpful to discuss anything in a more interactive way we could try to find a time to meet. Happy to keep things asynchronous / open though too. Thanks again for your work on this!

Comment on lines 312 to 314
fnd_min, fnd_max = flat_num_dates.min(), flat_num_dates.max()
min_delta = fnd_min * np.timedelta64(1, time_units)
max_delta = fnd_max * np.timedelta64(1, time_units)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nice thing about pd.to_timedelta is that it will not truncate floating point inputs. Is the issue that prompted this change that there is no way currently to specify the precision of the timedelta that pd.to_timedelta returns (pandas-dev/pandas#49060), i.e. it will still always return a nanosecond-precision timedelta?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, pd.to_timedelta will straight convert to ns-resolution (and this raises subsequently).

@ChrisBarker-NOAA
Copy link

Thanks for this -- I haven't thought through all the details, but in my case, I'd tend to set the defaut to ms upfront and be done with it -- I never need nanosecond precision, and I often need datetime compatible encoding :-)

A few thoughts: r.e. floating point time encoding -- I really wish so many people wouldn't do this. But as a rule, if folks are using float, they are getting less precision than they need, and if they are using double they are getting more, and I seriously doubt ANYONE has carefully thought out what precision to use, so I wouldn't worry about preserving the precision. if anyone actually needs better than millisecond precision, and is encoding with a floating point, type, they are doing something wrong already.

r.e. threading -- using a context manger for this in multithreading is HARD. I would tend to recommend that folks set it globally. And in tests, if the tests are multi-threaded, that could get really ugly :-(

However, it looks like the Python Decimal module may have solved this issue:

https://docs.python.org/3/library/decimal.html#decimal.localcontext

(If I understand tha=ose docs correct, which I may not) so maybe we could make a thread safe context manager?

@spencerkclark
Copy link
Member

spencerkclark commented Oct 7, 2024

r.e. floating point time encoding -- I really wish so many people wouldn't do this.

I think we are in agreement—I'm mainly pointing out that it does happen in the wild, and in some circumstances it can be relatively sane (e.g. encoding in multiples of quarters of days, or tenths of seconds, etc.).

For this PR, what I'm suggesting we consider is that we keep things simple and punt on any precision inference—e.g. if one uses xarray.set_options(time_resolution="s") then xarray should always produce second-precision datetime values when decoding, even if that means precision loss (right now that is not necessarily the case).

@ChrisBarker-NOAA
Copy link

Thanks! I think this is going in a great direction.

@kmuehlbauer
Copy link
Contributor Author

This PR will be superseded by #9618. While #9618 is based on this PR it goes a bit further and implements kind of automatic resolution inference where the default time resolution is the resolution to which dates should at least be decoded to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Interoperability with Pandas 2.0 non-nanosecond datetime
4 participants