Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interoperability with Pandas 2.0 non-nanosecond datetime #7493

Closed
khider opened this issue Jan 30, 2023 · 22 comments · Fixed by #9618
Closed

Interoperability with Pandas 2.0 non-nanosecond datetime #7493

khider opened this issue Jan 30, 2023 · 22 comments · Fixed by #9618

Comments

@khider
Copy link

khider commented Jan 30, 2023

Is your feature request related to a problem?

As mentioned in this post on the Pangeo discourse, Pandas 2.0 will fully support non-nanosecond datetime as indices. The motivation for this work was the paleogeosciences; a community who needs to represent time in millions of years. One of the biggest motivator is also to facilitate paleodata - model comparison. Enter xarray!

Below is a snippet of code to create a Pandas Series with a non-nanosecond datetime and export to xarray (this works). However, most of the interesting functionalities of xarray don't seem to support this datetime out-of-box:

import pandas as pd
import xarray as xr

pds = pd.Series([10, 12, 11, 9], index=np.array(['-2000-01-01', '-2005-01-01', '-2008-01-01', '-2009-01-01']).astype('M8[s]'))
xra = pds.to_xarray()
xra.plot() #matplotlib error
xra.sel(index='-2009-01-01', method='nearest') 

To test, you will need the Pandas nightly built:

pip uninstall pandas -y
pip install --pre --extra-index https://pypi.anaconda.org/scipy-wheels-nightly/simple pandas>1.9

Describe the solution you'd like

Work towards an integration of the new datetimes with xarray, which will support users beyond the paleoclimate community

Describe alternatives you've considered

No response

Additional context

No response

@TomNicholas
Copy link
Member

Hi @khider , thanks for raising this.

For those of us who haven't tried to use non-nanosecond datetimes before (e.g. me), could you possibly expand a bit more on

However, most of the interesting functionalities of xarray don't seem to support this datetime out-of-box:

specifically, where are errors being thrown from within xarray? And what functions are you referring to as examples?

@keewis
Copy link
Collaborator

keewis commented Jan 30, 2023

we are casting everything back to datetime64[ns] when creating xarray objects, for example, so the only way to even get a non-nanosecond datetime variable is (or was, we might have fixed that?) through the zarr backend (though that would / might fail elsewhere).

@spencerkclark knows much more about this, but in any case we're aware of the change and are working it (see e.g. #7441). (To be fair, though, at the moment it is mostly Spencer who's working on it, and he seems to be pretty preoccupied.)

@spencerkclark
Copy link
Member

Thanks for posting this general issue @khider. This is something that has been on my radar for several months and I'm on board with it being great to support (eventually this will likely help cftime support as well).

I might hesitate to say that I'm actively working on it yet 😬. Right now, in the time I have available, I'm mostly trying to make sure that xarray's existing functionality does not break under pandas 2.0. Once things are a little more stable in pandas with regard to this new feature my plan is to take a deeper dive into what it will take to adopt in xarray (some aspects might need to be handled delicately). We can plan on using this issue for more discussion.

As @keewis notes, xarray currently will cast any non-nanosecond precision datetime64 or timedelta64 values that are introduced to nanosecond-precision versions. This casting machinery goes through pandas, however, and I haven't looked carefully into how this is behaving/is expected to behave under pandas 2.0. @khider based on your nice example it seems that it is possible for non-nanosecond-precision values to slip through, which is something we may need to think about addressing for the time being.

@khider
Copy link
Author

khider commented Jan 31, 2023

Hi all,

Thank you for looking into this. I was very excited when the array was created from my non-nanosecond datetime index but I couldn't do much manipulations beyond creation.

@spencerkclark
Copy link
Member

Indeed it would be nice if this "just worked" but it may take some time to sort out (sorry that this example initially got your hopes up!). Here what I mean by "address" is continuing to prevent non-nanosecond-precision datetime values from entering xarray through casting to nanosecond precision and raising an informative error if that is not possible. This of course would be temporary until we work through the kinks of enabling such support. In the big picture it is exciting that pandas is doing this in part due to your grant.

@dcherian
Copy link
Contributor

@khider It would be helpful if either you or someone on your team tried to make it work and opened a PR. That would give us a sense of what's needed and might speed it along. It would be an advanced change, but we'd be happy to provide feedback.

Adding expected-fail tests would be particularly helpful!

@spencerkclark
Copy link
Member

@dcherian +1. I'm happy to engage with others if they are motivated to start on this earlier.

@khider
Copy link
Author

khider commented Feb 1, 2023

I might need some help with the xarray codebase. I use it quite often but never had to dig into its guts.

@TomNicholas
Copy link
Member

@khider we are more than happy to help with digging into the codebase! A reasonable place to start would be just trying the operation you want to perform, and looking through the code for the functions any errors get thrown from.

You are also welcome to join our bi-weekly community meetings (there is one tomorrow morning!) or the office hours we run.

@spencerkclark
Copy link
Member

I can block out time to join today's meeting or an upcoming one if it would be helpful.

@khider
Copy link
Author

khider commented Feb 1, 2023

I can attend it too. 8:30am PST, correct?

@spencerkclark
Copy link
Member

Great -- I'll plan on joining. That's correct. It is at 8:30 AM PT (#4001).

@spencerkclark
Copy link
Member

spencerkclark commented Feb 1, 2023

Thanks for joining the meeting today @khider. Some potentially relevant places in the code that come to my mind are:

Though as @shoyer says, searching for datetime64[ns] or timedelta64[ns] will probably go a long way toward finding most of these issues.

Some design questions that come to my mind are (but you don't need an answer to these immediately to start working):

  • How do we decide which precision to decode times to? Would it be the finest precision that enables decoding without overflow?

  • This is admittedly in the weeds, but how do we decide when to use cftime and when not to? It seems obvious that in the long term we should use NumPy values for proleptic Gregorian dates of all precisions, but what about dates from the Gregorian calendar (where we may no longer have the luxury that the proleptic Gregorian and Gregorian calendars are equivalent for all representable times)?

  • Not a blocker (since this is an existing issue) but are there ways we could make working with mixed precision datetime values friendlier with regard to overflow (ENH: overflow-safe astype for datetime64/timedelta64 unit conversion numpy/numpy#16352)? I worry about examples like this:

    >>> np.seterr(over="raise")
    >>> np.datetime64("1970-01-01", "ns") - np.datetime64("0001-01-01", "D")
    numpy.timedelta64(6795364578871345152,'ns')
    

@khider
Copy link
Author

khider commented Feb 1, 2023

Thank you!

The second point that you raise is what we are concerned about right now as well. So maybe it would be good to try to resolve it. How do you deal with PMIP simulations in terms of calendar?

@spencerkclark
Copy link
Member

spencerkclark commented Feb 1, 2023

Currently in xarray we make the choice based on the calendar attribute associated with the data on disk (following the CF conventions). If the data has a non-standard calendar (or cannot be represented with nanosecond-precision datetime values) then we use cftime; otherwise we use NumPy. Which kind of calendar do PMIP simulations typically use?

For some background -- my initial need in this realm came mainly from idealized climate model simulations (e.g. configured to start on 0001-01-01 with a no-leap calendar), so I do not have a ton of experience with paleoclimate research. I would be happy to learn more about your application, however!

@mjwillson
Copy link

Hi all, I just ran into a really nasty-to-track-down bug in xarray (version 2023.08.0, apologies if this is fixed since) where non-nanosecond datetimes are creeping in via expand_dims. Look at the difference between expand_dims and assign_coords:

In [33]: xarray.Dataset().expand_dims({'foo': [np.datetime64('2018-01-01')]})
Out[33]: 
<xarray.Dataset>
Dimensions:  (foo: 1)
Coordinates:
  * foo      (foo) datetime64[s] 2018-01-01
Data variables:
    *empty*

In [34]: xarray.Dataset().assign_coords({'foo': [np.datetime64('2018-01-01')]})
third_party/py/xarray/core/utils.py:1211: UserWarning: Converting non-nanosecond precision datetime values to nanosecond precision. This behavior can eventually be relaxed in xarray, as it is an artifact from pandas which is now beginning to support non-nanosecond precision values. This warning is caused by passing non-nanosecond np.datetime64 or np.timedelta64 values to the DataArray or Variable constructor; it can be silenced by converting the values to nanosecond precision ahead of time.
third_party/py/xarray/core/utils.py:1211: UserWarning: Converting non-nanosecond precision datetime values to nanosecond precision. This behavior can eventually be relaxed in xarray, as it is an artifact from pandas which is now beginning to support non-nanosecond precision values. This warning is caused by passing non-nanosecond np.datetime64 or np.timedelta64 values to the DataArray or Variable constructor; it can be silenced by converting the values to nanosecond precision ahead of time.
Out[34]: 
<xarray.Dataset>
Dimensions:  (foo: 1)
Coordinates:
  * foo      (foo) datetime64[ns] 2018-01-01
Data variables:
    *empty*

It seems for the time being xarray depends on datetime64[ns] being used everywhere for correct behaviour -- I've seen some very weird data corruption silently happen with datetimes when the wrong datetime64 types are used accidentally due to this bug. So good to be consistent about always enforcing datetime64[ns], for as long as this is the case.

@spencerkclark
Copy link
Member

Agreed, many thanks for the report @mjwillson—we'll have to track down why this slips through in the case of expand_dims.

@spencerkclark
Copy link
Member

@mjwillson I think I tracked down the cause of the expand_dims issue—see #8782 for a fix.

copybara-service bot pushed a commit to google-research/weatherbench2 that referenced this issue Apr 15, 2024
…on is not fully supported in Xarray. See pydata/xarray#7493

PiperOrigin-RevId: 625092146
copybara-service bot pushed a commit to google-research/weatherbench2 that referenced this issue Apr 15, 2024
…on is not fully supported in Xarray. See pydata/xarray#7493

PiperOrigin-RevId: 625092146
copybara-service bot pushed a commit to google-research/weatherbench2 that referenced this issue Apr 22, 2024
…on is not fully supported in Xarray. See pydata/xarray#7493

PiperOrigin-RevId: 625092146
copybara-service bot pushed a commit to google-research/weatherbench2 that referenced this issue Apr 22, 2024
…on is not fully supported in Xarray. See pydata/xarray#7493

PiperOrigin-RevId: 627130182
@kmuehlbauer
Copy link
Contributor

With the merge of #9618, xarray should be able to work with non-nanosecond datetime/timedelta resolution ("s", "ms", "us"). Please use latest main for testing and report any problems in dedicated issues with a MCVE. Thanks!

@CommonClimate
Copy link

CommonClimate commented Jan 16, 2025

This is wonderful news @kmuehlbauer - thank you for implementing it! In Pyleoclim we had to freeze our pandas version to 2.1.4 to preserve non-ns dtypes. What was your workaround on the pandas side (where several non-ns issues are still open, apparently)?

@spencerkclark
Copy link
Member

@CommonClimate we also encountered some rough patches, but for the most part things worked as expected. Are there reasons beyond resample failing for certain non-nanosecond times (LinkedEarth/Pyleoclim_util#517, pandas-dev/pandas#57427) that you pinned to 2.1.4?

@CommonClimate
Copy link

Hi @spencerkclark, that is the reason. Our entire pandas-dependent stack works with that version but not the more recent ones, as far as I know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants