Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fill gaps limited 7665 #9402

Open
wants to merge 46 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 39 commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
e65bb4d
Introduce new arguments limit_direction, limit_area, limit_use coordi…
Ockenfuss Jun 10, 2024
63dabc9
Use internal broadcasting and transpose instead of ones_like
Ockenfuss Jun 10, 2024
fdd3ca7
Typo: Default False in doc for limit_use_coordinates
Ockenfuss Jun 10, 2024
8393d72
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 10, 2024
e725008
Towards masked implementation
Ockenfuss Jun 11, 2024
d5466f5
Working fill_gaps implementation
Ockenfuss Jun 20, 2024
1b8ea9e
Remove keep_attrs from docstring of filling functions
Ockenfuss Aug 23, 2024
b956e14
Fix typos, undo empty spaces, remove temporarily introduced arguments
Ockenfuss Aug 23, 2024
d717dd9
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 23, 2024
0c4fdab
Add line break for readability
Ockenfuss Aug 24, 2024
7f06b3a
Enforce kwargs to be passed by name
Ockenfuss Aug 24, 2024
6090a4d
Keep_Attrs: Default to True
Ockenfuss Aug 24, 2024
f8cc0c5
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 24, 2024
6de7640
Explicitly add fill functions in GapMask object
Ockenfuss Aug 25, 2024
07a0d01
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 25, 2024
274168c
Add type hints to most arguments, return types
Ockenfuss Aug 25, 2024
e455f5b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 25, 2024
d58b0ac
Fix accidental double pasting of arguments
Ockenfuss Aug 25, 2024
9700119
Fix more mypy errors
Ockenfuss Aug 25, 2024
4a360fa
Bottleneck is required for limit functionality
Ockenfuss Aug 25, 2024
7389bf7
Docs: Require numbagg or bottleneck for ffill/bfill/fill_gaps
Ockenfuss Aug 26, 2024
93c72f5
Rework index conversion to have consistent typing
Ockenfuss Aug 26, 2024
72c76db
Add new method to api.rst
Ockenfuss Oct 2, 2024
6631aeb
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 21, 2025
7102381
Reimport utils (deleted during rebase)
Ockenfuss Jan 22, 2025
5fe9feb
Remove typo (double line)
Ockenfuss Jan 22, 2025
a74a840
Add a hint in the interpolate_na docs about the forward (legacy) beha…
Ockenfuss Jan 22, 2025
3b7d6bc
Add example in User guide
Ockenfuss Jan 22, 2025
9a5bc29
Fix typo in documentation
Ockenfuss Jan 22, 2025
511323d
Fix typing errors (ignore types for limit argument internally)
Ockenfuss Jan 22, 2025
2463ded
Remove return type of interp_na to avoid mypy error
Ockenfuss Jan 22, 2025
ee34faf
Include documentation for GapMask Object
Ockenfuss Jan 22, 2025
ec221dd
Include references in docs between filling functions
Ockenfuss Jan 22, 2025
41f2bf8
Doc-Bug: Default for direction is both
Ockenfuss Jan 22, 2025
2c0d375
Typo in Documentation
Ockenfuss Jan 23, 2025
9ae4b26
Do not allow further limit or max_gap specification when calling inte…
Ockenfuss Jan 23, 2025
5cdc223
Make two stages of filling clear in fill_gaps documentation
Ockenfuss Jan 23, 2025
078e546
Default to forward for ffill and backward to bfill.
Ockenfuss Jan 24, 2025
e79f90f
Update api.rst and GapMask Attributes
Ockenfuss Jan 24, 2025
e560701
Split ffill/bfill pandas compatibility test into separate test method
Ockenfuss Jan 24, 2025
281ba6d
Fix doc examples and correct return type hint
Ockenfuss Jan 24, 2025
fd04d54
Require bottleneck for ffill test
Ockenfuss Jan 24, 2025
94c88e3
Fix mask type hint in two positions.
Ockenfuss Jan 24, 2025
046a592
Remove fill_value in pandas when method=ffill/bfill
Ockenfuss Jan 24, 2025
ca4547d
Remove ffill check against old pandas version
Ockenfuss Jan 24, 2025
76424b1
Add additional tests for the direction kwarg in combination with ffil…
Ockenfuss Jan 24, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -167,6 +167,7 @@ Missing value handling
Dataset.fillna
Dataset.ffill
Dataset.bfill
Dataset.fill_gaps
Dataset.interpolate_na
Dataset.where
Dataset.isin
Expand Down Expand Up @@ -357,6 +358,7 @@ Missing value handling
DataArray.fillna
DataArray.ffill
DataArray.bfill
DataArray.fill_gaps
DataArray.interpolate_na
DataArray.where
DataArray.isin
Expand Down Expand Up @@ -1492,6 +1494,21 @@ DataArray
DataArrayResample.dims
DataArrayResample.groups

GapMask object
===============

.. currentmodule:: xarray.core.missing

.. autosummary::
:toctree: generated/

GapMask
GapMask.fillna
GapMask.ffill
GapMask.bfill
GapMask.interpolate_na
GapMask.get_mask

Accessors
=========

Expand Down
15 changes: 15 additions & 0 deletions doc/user-guide/computation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -203,6 +203,21 @@ Xarray also provides the ``max_gap`` keyword argument to limit the interpolation
data gaps of length ``max_gap`` or smaller. See :py:meth:`~xarray.DataArray.interpolate_na`
for more.

All of the above methods by default fill gaps of any size in the data. If you want fine control over the size of the gaps that are filled, you can use :py:meth:`~xarray.DataArray.fill_gaps`. For example, consider a series of air temperature measurements with gaps:

.. ipython:: python

n = np.nan
temperature = xr.DataArray(
[n, 1.1, n, n, n, 2, n, n, n, n, 2.3],
coords={"time": xr.Variable("time", [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])},
)
temperature.fill_gaps(
"time", limit=1, limit_direction="both", max_gap=4
).interpolate_na("time")

In this example, we interpolate valid measurements up to one hour forward and backward in time. However, if a gap is longer than four hours, nothing is interpolated. :py:meth:`~xarray.DataArray.fill_gaps` returns a :py:class:`~xarray.core.missing.GapMask` object that works with all filling methods (:py:meth:`~xarray.DataArray.ffill`, :py:meth:`~xarray.DataArray.bfill`, :py:meth:`~xarray.DataArray.fillna`, :py:meth:`~xarray.DataArray.interpolate_na`). See :py:meth:`~xarray.DataArray.fill_gaps` for more information on the available options.

.. _agg:

Aggregation
Expand Down
208 changes: 189 additions & 19 deletions xarray/core/dataarray.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,7 @@
from xarray.backends import ZarrStore
from xarray.backends.api import T_NetcdfEngine, T_NetcdfTypes
from xarray.core.groupby import DataArrayGroupBy
from xarray.core.missing import GapMask
from xarray.core.resample import DataArrayResample
from xarray.core.rolling import DataArrayCoarsen, DataArrayRolling
from xarray.core.types import (
Expand All @@ -109,6 +110,8 @@
GroupIndices,
GroupInput,
InterpOptions,
LimitAreaOptions,
LimitDirectionOptions,
PadModeOptions,
PadReflectOptions,
QuantileMethods,
Expand All @@ -120,6 +123,7 @@
SideOptions,
T_ChunkDimFreq,
T_ChunksFreq,
T_GapLength,
T_Xarray,
)
from xarray.core.weighted import DataArrayWeighted
Expand Down Expand Up @@ -3475,6 +3479,11 @@ def fillna(self, value: Any) -> Self:
-------
filled : DataArray

See Also
--------
:ref:`missing_values`
DataArray.fill_gaps

Examples
--------
>>> da = xr.DataArray(
Expand Down Expand Up @@ -3520,27 +3529,19 @@ def fillna(self, value: Any) -> Self:

def interpolate_na(
self,
dim: Hashable | None = None,
dim: Hashable,
Ockenfuss marked this conversation as resolved.
Show resolved Hide resolved
method: InterpOptions = "linear",
limit: int | None = None,
use_coordinate: bool | str = True,
max_gap: (
None
| int
| float
| str
| pd.Timedelta
| np.timedelta64
| datetime.timedelta
) = None,
use_coordinate: bool | Hashable = True,
max_gap: T_GapLength | None = None,
keep_attrs: bool | None = None,
**kwargs: Any,
) -> Self:
"""Fill in NaNs by interpolating according to different methods.

Parameters
----------
dim : Hashable or None, optional
dim : Hashable
Specifies the dimension along which to interpolate.
method : {"linear", "nearest", "zero", "slinear", "quadratic", "cubic", "polynomial", \
"barycentric", "krogh", "pchip", "spline", "akima"}, default: "linear"
Expand All @@ -3555,17 +3556,17 @@ def interpolate_na(
- 'barycentric', 'krogh', 'pchip', 'spline', 'akima': use their
respective :py:class:`scipy.interpolate` classes.

limit : int or None, default: None
Maximum number of consecutive NaNs to fill. Must be greater than 0
or None for no limit. This filling is done in the forward direction, regardless of the size of
the gap in the data. To only interpolate over gaps less than a given length,
see ``max_gap``.
use_coordinate : bool or str, default: True
Specifies which index to use as the x values in the interpolation
formulated as `y = f(x)`. If False, values are treated as if
equally-spaced along ``dim``. If True, the IndexVariable `dim` is
used. If ``use_coordinate`` is a string, it specifies the name of a
coordinate variable to use as the index.
limit : int or None, default: None
Maximum number of consecutive NaNs to fill. Must be greater than 0
or None for no limit. This filling is done regardless of the size of
the gap in the data. To only interpolate over gaps less than a given length,
see ``max_gap``.
max_gap : int, float, str, pandas.Timedelta, numpy.timedelta64, datetime.timedelta, default: None
Maximum size of gap, a continuous sequence of NaNs, that will be filled.
Use None for no limit. When interpolating along a datetime64 dimension
Expand Down Expand Up @@ -3603,6 +3604,7 @@ def interpolate_na(

See Also
--------
DataArray.fill_gaps
numpy.interp
scipy.interpolate

Expand All @@ -3611,6 +3613,7 @@ def interpolate_na(
>>> da = xr.DataArray(
... [np.nan, 2, 3, np.nan, 0], dims="x", coords={"x": [0, 1, 2, 3, 4]}
... )

>>> da
<xarray.DataArray (x: 5)> Size: 40B
array([nan, 2., 3., nan, 0.])
Expand Down Expand Up @@ -3645,7 +3648,7 @@ def interpolate_na(
def ffill(self, dim: Hashable, limit: int | None = None) -> Self:
"""Fill NaN values by propagating values forward

*Requires bottleneck.*
*Requires numbagg or bottleneck.*

Parameters
----------
Expand All @@ -3663,6 +3666,11 @@ def ffill(self, dim: Hashable, limit: int | None = None) -> Self:
-------
filled : DataArray

See Also
--------
:ref:`missing_values`
DataArray.fill_gaps

Examples
--------
>>> temperature = np.array(
Expand Down Expand Up @@ -3729,7 +3737,7 @@ def ffill(self, dim: Hashable, limit: int | None = None) -> Self:
def bfill(self, dim: Hashable, limit: int | None = None) -> Self:
"""Fill NaN values by propagating values backward

*Requires bottleneck.*
*Requires numbagg or bottleneck.*

Parameters
----------
Expand All @@ -3747,6 +3755,11 @@ def bfill(self, dim: Hashable, limit: int | None = None) -> Self:
-------
filled : DataArray

See Also
--------
:ref:`missing_values`
DataArray.fill_gaps

Examples
--------
>>> temperature = np.array(
Expand Down Expand Up @@ -3810,6 +3823,163 @@ def bfill(self, dim: Hashable, limit: int | None = None) -> Self:

return bfill(self, dim, limit=limit)

def fill_gaps(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts from others on the naming? Would .fill be insufficiently specific that it's filling na? Would fill_missing be clearer than fill_gaps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open to any of those. .fill sounds very concise, but maybe this is easily confused with .ffill

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think .fill could be quite nice, do others have a view?

Copy link
Contributor

@dcherian dcherian Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe gap_filler instead, since this method does not actually fill the gaps.

I'm also wondering if its better to have a method that constructs the appropriate mask that can be used later

mask = ds.get_gap_mask(max_gap=...)
ds.ffill(...).where(~mask)

Copy link
Contributor Author

@Ockenfuss Ockenfuss Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points! Just to explain:

gap_filler emphasizes the returned object type nicely! However, I choose fill_gaps because it fits the naming scheme of other object-returning functions better (e.g. rolling and coarsen are not called roller and coarser in xarray, even though the operation is not perfomed immediately and an object is returned).
Ultimately (I am a non-native english speaker) I am happy for any recommendations regarding nomenclature.
If you prefer gap_filler, I will change accordingly.

The function API is also presented as an alternative in the initial proposal. I decided to go for the object way because it is shorter (one line) and less error prone (you might easily forget the ~). If the mask is required, you can easily get it from the object:

mask=ds.gap_filler(...).mask

self,
dim: Hashable,
*,
use_coordinate: bool | Hashable = True,
limit: T_GapLength | None = None,
limit_direction: LimitDirectionOptions | None = None,
limit_area: LimitAreaOptions | None = None,
max_gap: T_GapLength | None = None,
) -> GapMask[DataArray]:
"""Fill in gaps (consecutive missing values) in the data.

- Firstly, ``fill_gaps`` determines **which** values to fill, with options for fine control how far to extend the valid data into the gaps and the maximum size of the gaps to fill.
- Secondly, calling one of several filling methods determines **how** to fill the selected values.


*Requires numbagg or bottleneck.*

Parameters
----------
dim : Hashable
Specifies the dimension along which to calculate gap sizes.
use_coordinate : bool or Hashable, default: True
Specifies which index to use when calculating gap sizes.

- False: a consecutive integer index is created along ``dim`` (0, 1, 2, ...).
- True: the IndexVariable `dim` is used.
- String: specifies the name of a coordinate variable to use as the index.

limit : int, float, str, pandas.Timedelta, numpy.timedelta64, datetime.timedelta, default: None
Maximum number or distance of consecutive NaNs to fill.
Use None for no limit. When filling along a datetime64 dimension
and ``use_coordinate=True``, ``limit`` can be one of the following:

- a string that is valid input for pandas.to_timedelta
- a :py:class:`numpy.timedelta64` object
- a :py:class:`pandas.Timedelta` object
- a :py:class:`datetime.timedelta` object

Otherwise, ``limit`` must be an int or a float.
If ``use_coordinates=True``, for ``limit_direction=forward`` distance is defined
as the difference between the coordinate at a NaN value and the coordinate of the next valid value
to the left (right for ``limit_direction=backward``).
For example, consider::

<xarray.DataArray (x: 9)>
array([nan, nan, nan, 1., nan, nan, 4., nan, nan])
Coordinates:
* x (x) int64 0 1 2 3 4 5 6 7 8

For ``limit_direction=forward``, distances are ``[nan, nan, nan, 0, 1, 2, 0, 1, 2]``.
To only fill gaps less than a given length,
see ``max_gap``.
limit_direction: {"forward", "backward", "both"}, default: None
Consecutive NaNs will be filled in this direction.
If not specified, the default is

- "forward" if ``ffill`` is used
- "backward" if ``bfill`` is used
- "both" otherwise

raises ValueError if not "forward" and ``ffill`` is used or not "backward" and ``bfill`` is used.
limit_area: {"inside", "outside"} or None: default: None
Consecutive NaNs will be filled with this restriction.

- None: No fill restriction.
- "inside": Only fill NaNs surrounded by valid values
- "outside": Only fill NaNs outside valid values (extrapolate).
max_gap : int, float, str, pandas.Timedelta, numpy.timedelta64, datetime.timedelta, default: None
Maximum size of gap, a continuous sequence of NaNs, that will be filled.
Use None for no limit. When calculated along a datetime64 dimension
and ``use_coordinate=True``, ``max_gap`` can be one of the following:

- a string that is valid input for pandas.to_timedelta
- a :py:class:`numpy.timedelta64` object
- a :py:class:`pandas.Timedelta` object
- a :py:class:`datetime.timedelta` object

Otherwise, ``max_gap`` must be an int or a float. If ``use_coordinate=False``, a linear integer
index is created. Gap length is defined as the difference
between coordinate values at the first data point after a gap and the last valid value
before a gap. For gaps at the beginning (end), gap length is defined as the difference
between coordinate values at the first (last) valid data point and the first (last) NaN.
For example, consider::

<xarray.DataArray (x: 9)>
array([nan, nan, nan, 1., nan, nan, 4., nan, nan])
Coordinates:
* x (x) int64 0 1 2 3 4 5 6 7 8

The gap lengths are 3-0 = 3; 6-3 = 3; and 8-6 = 2 respectively

Returns
-------
Gap Mask: core.missing.GapMask

An object where all remaining gaps are masked. Unmasked values can be filled by calling any of the provided methods.

See Also
--------
:ref:`missing_values`
DataArray.fillna
DataArray.ffill
DataArray.bfill
DataArray.interpolate_na
pandas.DataFrame.interpolate

Notes
-----
``Limit`` and ``max_gap`` have different effects on gaps: If ``limit`` is set, *some* values in a gap will be filled (up to the given distance from the boundaries). ``max_gap`` will prevent *any* filling for gaps larger than the given distance.
Ockenfuss marked this conversation as resolved.
Show resolved Hide resolved

Examples
--------
>>> da = xr.DataArray(
... [np.nan, 2, np.nan, np.nan, 5, np.nan, 0],
... dims="x",
... coords={"x": [0, 1, 2, 3, 4, 5, 6]},
... )

>>> da
<xarray.DataArray (x: 7)> Size: 56B
array([nan, 2., nan, nan, 5., nan, 0.])
Coordinates:
* x (x) int64 56B 0 1 2 3 4 5 6

>>> da.fill_gaps(dim="x", limit=1, limit_direction="forward").interpolate_na(
... dim="x"
... )
<xarray.DataArray (x: 7)> Size: 56B
array([nan, 2. , 3. , nan, 5. , 2.5, 0. ])
Coordinates:
* x (x) int64 56B 0 1 2 3 4 5 6

>>> da.fill_gaps(dim="x", max_gap=2, limit_direction="forward").ffill(dim="x")
<xarray.DataArray (x: 7)> Size: 56B
array([nan, 2., nan, nan, 5., 5., 0.])
Coordinates:
* x (x) int64 56B 0 1 2 3 4 5 6

>>> da.fill_gaps(dim="x", limit_area="inside").fillna(9)
Ockenfuss marked this conversation as resolved.
Show resolved Hide resolved
<xarray.DataArray (x: 7)> Size: 56B
array([nan, 2., 9., 9., 5., 9., 0.])
Coordinates:
* x (x) int64 56B 0 1 2 3 4 5 6
"""
from xarray.core.missing import GapMask

return GapMask(
self,
dim,
use_coordinate=use_coordinate,
limit=limit,
limit_direction=limit_direction,
limit_area=limit_area,
max_gap=max_gap,
)

def combine_first(self, other: Self) -> Self:
"""Combine two DataArray objects, with union of coordinates.

Expand Down
Loading
Loading