fix mean for datetime-like using the respective time resolution unit #9977

kmuehlbauer · 2025-01-23T08:22:38Z

Closes Averaging timestamps with non-nanosecond precision #9975
Tests adapted
User visible changes (including notable bug fixes) are documented in whats-new.rst

…ion unit, adapting tests

sfinkens · 2025-01-23T09:20:04Z

Works fine, thanks a lot!

kmuehlbauer · 2025-01-23T09:33:45Z

Works fine, thanks a lot!

Great, let's wait for some more feedback. @dcherian and @spencerkclark would you mind taking a look here?

dcherian · 2025-01-24T17:09:41Z

xarray/core/duck_array_ops.py

+        # from version 2025.01.2 xarray uses np.datetime64[unit] where unit
+        # is one of "s", "ms", "us", "ns"
+        # the respective unit is used for the timedelta representation
+        unit, _ = np.datetime_data(offset.dtype)


can we just pass offset.dtype in the astype call?

Unfortunately not, because we have datetime64 as offset. I'd happily change this, if there is a better method to distribute the resolution.

@dcherian Your comment made me think of something else, and it seems the current solution is not clever enough. For instance:

import xarray as xr import numpy as np timestamps = xr.DataArray( [np.datetime64("1970-01-01 00:00:01"), np.datetime64("1970-01-01 00:00:02"), np.datetime64("NaT", "s")] ) timestamps.mean()

Out[2]: <xarray.DataArray ()> Size: 8B array('1970-01-01T01:00:01', dtype='datetime64[s]')

This will not cover for the needed higher resolution, as we expect 0.500 seconds here. This wasn't an issue before as we already where at ns at that point.

There is a solution over in

xarray/xarray/coding/times.py

Line 574 in 1c7ee65

def _numbers_to_timedelta(

which moves to the needed higher resolution. But this is a pure numpy function and would have to be wrapped somehow to handle duck arrays. I'd give this a try, but would need a bit of guidance. Is there some example where I could get some inspiration?

It's a good question @kmuehlbauer. Both pandas and NumPy seem to maintain the dtype, despite producing an imprecise result:

>>> times = pd.date_range("2000", freq="s", periods=2, unit="s") >>> times.mean() Timestamp('2000-01-01 00:00:00')

NumPy doesn't currently support taking the mean of np.datetime64 values (numpy/numpy#12901), which is why we have our own logic here, but we can do something analogous with np.timedelta64 values:

>>> np.array([0, 1]).astype("timedelta64[s]").mean() np.timedelta64(0,'s')

I wonder if we should just do the same for now?

I guess an argument against something fancier is that it could cause problems with overflow—consider the following example:

>>> pd.date_range("1500", freq="us", periods=2, unit="us").mean() Timestamp('1500-01-01 00:00:00')

Technically we would need nanosecond-precision to resolve the true mean, but it would be for an out-of-bounds time at that precision.

I think I've found a backwards compatible solution to this. If we assume our dateime64 is of some resolution unit then the output of _mean will also be in this resolution. As offset still has the same resolution we can just add the int64 representation of _mean. So casting to int64 will do.

NumPy doesn't currently support taking the mean of np.datetime64 values (numpy/numpy#12901), which is why we have our own logic here, but we can do something analogous with np.timedelta64 values:

>>> np.array([0, 1]).astype("timedelta64[s]").mean() np.timedelta64(0,'s')

The current solution (just casting _mean output to int64) will work as usual for "ns" resolution input. For lower resolution input it will align with pandas and numpy. Would that be good enough for now?

So finally it turned out that casting to "timedelta64" (without unit) will work without any casting RuntimeWarnings.

spencerkclark

Thanks @kmuehlbauer for jumping on this—a couple more comments. I think it was useful to revisit this code independent of the non-nanosecond update.

xarray/core/duck_array_ops.py

spencerkclark · 2025-01-25T13:45:55Z

xarray/core/duck_array_ops.py

@@ -715,8 +713,11 @@ def mean(array, axis=None, skipna=None, **kwargs):
    if dtypes.is_datetime_like(array.dtype):
        offset = _datetime_nanmin(array)


This was an issue before, but I think it would be safest to use the Unix epoch ("1970-01-01") for the offset, since this would guarantee the timedeltas would be representable with the same precision as the datetimes. For example, this produces the incorrect result independent of this PR:

>>> array = np.array(["1678-01-01", "2260-01-01"], dtype="datetime64[ns]") >>> times = xr.DataArray(array, dims=["time"]) >>> times.mean() <xarray.DataArray ()> Size: 8B array('2261-04-11T23:47:16.854775808', dtype='datetime64[ns]')

I'm happy to address that in another PR if you would like (it is possible there are further simplifications we could make to this logic).

Co-authored-by: Spencer Clark <[email protected]>

xarray/core/duck_array_ops.py

kmuehlbauer · 2025-01-26T10:53:32Z

Apart from the loss of resolution when the calculated mean contains fractional part (but is in alignment with pandas and numpy), this is working for all available resolutions. In case of ns resolution this replicates the old behaviour.

fix mean for datetime-like by using the respective dtype time resolut…

f1e8be4

…ion unit, adapting tests

kmuehlbauer mentioned this pull request Jan 23, 2025

Averaging timestamps with non-nanosecond precision #9975

Open

5 tasks

fix mypy

a66656b

kmuehlbauer added 2 commits January 23, 2025 11:53

add PR to existing entry for non-nanosecond datetimes

618760d

Merge branch 'main' into fix-datetime-mean

9226088

dcherian reviewed Jan 24, 2025

View reviewed changes

spencerkclark reviewed Jan 25, 2025

View reviewed changes

kmuehlbauer and others added 2 commits January 25, 2025 19:00

Update xarray/core/duck_array_ops.py

5d919a0

Co-authored-by: Spencer Clark <[email protected]>

cast to "int64" in calculation of datime-like mean

dfc9e03

kmuehlbauer commented Jan 26, 2025

View reviewed changes

xarray/core/duck_array_ops.py Outdated Show resolved Hide resolved

kmuehlbauer commented Jan 26, 2025

View reviewed changes

xarray/core/duck_array_ops.py Outdated Show resolved Hide resolved

Apply suggestions from code review

180aa0a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix mean for datetime-like using the respective time resolution unit #9977

fix mean for datetime-like using the respective time resolution unit #9977

kmuehlbauer commented Jan 23, 2025 •

edited

Loading

sfinkens commented Jan 23, 2025

kmuehlbauer commented Jan 23, 2025

dcherian Jan 24, 2025

kmuehlbauer Jan 24, 2025

kmuehlbauer Jan 25, 2025

spencerkclark Jan 25, 2025

kmuehlbauer Jan 25, 2025

kmuehlbauer Jan 25, 2025

kmuehlbauer Jan 26, 2025

spencerkclark left a comment

spencerkclark Jan 25, 2025 •

edited

Loading

kmuehlbauer commented Jan 26, 2025

		@@ -715,8 +713,11 @@ def mean(array, axis=None, skipna=None, **kwargs):
		if dtypes.is_datetime_like(array.dtype):
		offset = _datetime_nanmin(array)

fix mean for datetime-like using the respective time resolution unit #9977

Are you sure you want to change the base?

fix mean for datetime-like using the respective time resolution unit #9977

Conversation

kmuehlbauer commented Jan 23, 2025 • edited Loading

sfinkens commented Jan 23, 2025

kmuehlbauer commented Jan 23, 2025

dcherian Jan 24, 2025

Choose a reason for hiding this comment

kmuehlbauer Jan 24, 2025

Choose a reason for hiding this comment

kmuehlbauer Jan 25, 2025

Choose a reason for hiding this comment

spencerkclark Jan 25, 2025

Choose a reason for hiding this comment

kmuehlbauer Jan 25, 2025

Choose a reason for hiding this comment

kmuehlbauer Jan 25, 2025

Choose a reason for hiding this comment

kmuehlbauer Jan 26, 2025

Choose a reason for hiding this comment

spencerkclark left a comment

Choose a reason for hiding this comment

spencerkclark Jan 25, 2025 • edited Loading

Choose a reason for hiding this comment

kmuehlbauer commented Jan 26, 2025

kmuehlbauer commented Jan 23, 2025 •

edited

Loading

spencerkclark Jan 25, 2025 •

edited

Loading