Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix mean for datetime-like using the respective time resolution unit #9977

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

kmuehlbauer
Copy link
Contributor

@kmuehlbauer kmuehlbauer commented Jan 23, 2025

@sfinkens
Copy link

Works fine, thanks a lot!

@kmuehlbauer
Copy link
Contributor Author

Works fine, thanks a lot!

Great, let's wait for some more feedback. @dcherian and @spencerkclark would you mind taking a look here?

# from version 2025.01.2 xarray uses np.datetime64[unit] where unit
# is one of "s", "ms", "us", "ns"
# the respective unit is used for the timedelta representation
unit, _ = np.datetime_data(offset.dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just pass offset.dtype in the astype call?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not, because we have datetime64 as offset. I'd happily change this, if there is a better method to distribute the resolution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dcherian Your comment made me think of something else, and it seems the current solution is not clever enough. For instance:

import xarray as xr
import numpy as np

timestamps = xr.DataArray(
    [np.datetime64("1970-01-01 00:00:01"),
     np.datetime64("1970-01-01 00:00:02"), np.datetime64("NaT", "s")]

)
timestamps.mean()
Out[2]: 
<xarray.DataArray ()> Size: 8B
array('1970-01-01T01:00:01', dtype='datetime64[s]')

This will not cover for the needed higher resolution, as we expect 0.500 seconds here. This wasn't an issue before as we already where at ns at that point.

There is a solution over in

def _numbers_to_timedelta(
which moves to the needed higher resolution. But this is a pure numpy function and would have to be wrapped somehow to handle duck arrays. I'd give this a try, but would need a bit of guidance. Is there some example where I could get some inspiration?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good question @kmuehlbauer. Both pandas and NumPy seem to maintain the dtype, despite producing an imprecise result:

>>> times = pd.date_range("2000", freq="s", periods=2, unit="s")
>>> times.mean()
Timestamp('2000-01-01 00:00:00')

NumPy doesn't currently support taking the mean of np.datetime64 values (numpy/numpy#12901), which is why we have our own logic here, but we can do something analogous with np.timedelta64 values:

>>> np.array([0, 1]).astype("timedelta64[s]").mean()
np.timedelta64(0,'s')

I wonder if we should just do the same for now?

I guess an argument against something fancier is that it could cause problems with overflow—consider the following example:

>>> pd.date_range("1500", freq="us", periods=2, unit="us").mean()
Timestamp('1500-01-01 00:00:00')

Technically we would need nanosecond-precision to resolve the true mean, but it would be for an out-of-bounds time at that precision.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've found a backwards compatible solution to this. If we assume our dateime64 is of some resolution unit then the output of _mean will also be in this resolution. As offset still has the same resolution we can just add the int64 representation of _mean. So casting to int64 will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumPy doesn't currently support taking the mean of np.datetime64 values (numpy/numpy#12901), which is why we have our own logic here, but we can do something analogous with np.timedelta64 values:

>>> np.array([0, 1]).astype("timedelta64[s]").mean()
np.timedelta64(0,'s')

The current solution (just casting _mean output to int64) will work as usual for "ns" resolution input. For lower resolution input it will align with pandas and numpy. Would that be good enough for now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So finally it turned out that casting to "timedelta64" (without unit) will work without any casting RuntimeWarnings.

Copy link
Member

@spencerkclark spencerkclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kmuehlbauer for jumping on this—a couple more comments. I think it was useful to revisit this code independent of the non-nanosecond update.

xarray/core/duck_array_ops.py Outdated Show resolved Hide resolved
@@ -715,8 +713,11 @@ def mean(array, axis=None, skipna=None, **kwargs):
if dtypes.is_datetime_like(array.dtype):
offset = _datetime_nanmin(array)
Copy link
Member

@spencerkclark spencerkclark Jan 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an issue before, but I think it would be safest to use the Unix epoch ("1970-01-01") for the offset, since this would guarantee the timedeltas would be representable with the same precision as the datetimes. For example, this produces the incorrect result independent of this PR:

>>> array = np.array(["1678-01-01", "2260-01-01"], dtype="datetime64[ns]")
>>> times = xr.DataArray(array, dims=["time"])
>>> times.mean()
<xarray.DataArray ()> Size: 8B
array('2261-04-11T23:47:16.854775808', dtype='datetime64[ns]')

I'm happy to address that in another PR if you would like (it is possible there are further simplifications we could make to this logic).

@kmuehlbauer
Copy link
Contributor Author

Apart from the loss of resolution when the calculated mean contains fractional part (but is in alignment with pandas and numpy), this is working for all available resolutions. In case of ns resolution this replicates the old behaviour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Averaging timestamps with non-nanosecond precision
4 participants