Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fill gaps limited 7665 #9402

Open
wants to merge 46 commits into
base: main
Choose a base branch
from

Conversation

Ockenfuss
Copy link
Contributor

@Ockenfuss Ockenfuss commented Aug 23, 2024

Improve gap filling

This PR involves a full implementation, documentation and corresponding tests. As discussed in #7665, a new API function fill_gaps was introduced, to avoid breaking changes of existing code and keep the argument inflation of the filling functions to a minimum.

TBD if accepted:

  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst
  • Add Example in Documentation

Deprecations

  • Using limit in interpolate_na now requires numbagg or bottleneck. So far, limit worked without bottleneck, max_gap required it (not documented). Reason: max_gap and limit now share the same codebase using ffill. Since no one ever complained about the missing documentation and the error message is pretty clear and easy to resolve ("No module named 'bottleneck"), I hope this might be acceptable.

Copy link
Collaborator

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems extremely good!!

I left some refine-y comments, nothing preclusive.

What do others think? I don't use interpolate_na myself much so other may have views...

xarray/core/dataarray.py Show resolved Hide resolved
xarray/core/dataset.py Outdated Show resolved Hide resolved
xarray/core/dataset.py Show resolved Hide resolved
xarray/core/dataset.py Outdated Show resolved Hide resolved
| datetime.timedelta
) = None,
limit_direction: LimitDirectionOptions = "both",
limit_area: LimitAreaOptions | None = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refine-y but:

  • are interpolate and extrapolate clearer options?
  • is there a better name for the param? I'm not sure. "area" sounds 2D to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. I adopted this nomenclature from pandas. What about limit_region, does this sound less 2D?
Personally, I like the 'inside'/'outside' options, 'interpolate' sounds close to linear interpolation to me...

xarray/core/dataset.py Outdated Show resolved Hide resolved
@@ -3766,6 +3769,160 @@ def bfill(self, dim: Hashable, limit: int | None = None) -> Self:

return bfill(self, dim, limit=limit)

def fill_gaps(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts from others on the naming? Would .fill be insufficiently specific that it's filling na? Would fill_missing be clearer than fill_gaps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open to any of those. .fill sounds very concise, but maybe this is easily confused with .ffill

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think .fill could be quite nice, do others have a view?

Copy link
Contributor

@dcherian dcherian Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe gap_filler instead, since this method does not actually fill the gaps.

I'm also wondering if its better to have a method that constructs the appropriate mask that can be used later

mask = ds.get_gap_mask(max_gap=...)
ds.ffill(...).where(~mask)

Copy link
Contributor Author

@Ockenfuss Ockenfuss Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points! Just to explain:

gap_filler emphasizes the returned object type nicely! However, I choose fill_gaps because it fits the naming scheme of other object-returning functions better (e.g. rolling and coarsen are not called roller and coarser in xarray, even though the operation is not perfomed immediately and an object is returned).
Ultimately (I am a non-native english speaker) I am happy for any recommendations regarding nomenclature.
If you prefer gap_filler, I will change accordingly.

The function API is also presented as an alternative in the initial proposal. I decided to go for the object way because it is shorter (one line) and less error prone (you might easily forget the ~). If the mask is required, you can easily get it from the object:

mask=ds.gap_filler(...).mask

xarray/core/missing.py Outdated Show resolved Hide resolved
xarray/core/missing.py Outdated Show resolved Hide resolved
xarray/core/missing.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@headtr1ck headtr1ck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically all the new methods are missing return types, please consider adding them.

The new class should be generic, I have added a few review comments here and there but more adoptions are needed.

xarray/core/missing.py Outdated Show resolved Hide resolved
xarray/core/missing.py Outdated Show resolved Hide resolved
xarray/core/missing.py Outdated Show resolved Hide resolved
xarray/core/missing.py Outdated Show resolved Hide resolved
xarray/core/missing.py Outdated Show resolved Hide resolved
xarray/core/missing.py Outdated Show resolved Hide resolved
xarray/core/dataarray.py Show resolved Hide resolved
@Ockenfuss
Copy link
Contributor Author

Basically all the new methods are missing return types, please consider adding them.

The new class should be generic, I have added a few review comments here and there but more adoptions are needed.

Thanks a lot for the review of the typing! I`ll work through them asap and try to make all adoptions (probabily still need some help afterwards, though ...)

xarray/core/dataarray.py Outdated Show resolved Hide resolved
@Ockenfuss
Copy link
Contributor Author

@headtr1ck I did my best to add type hints everywhere. Not sure how to resolve the last four remaining mypy errors...

@keewis keewis linked an issue Aug 26, 2024 that may be closed by this pull request
@headtr1ck
Copy link
Collaborator

@headtr1ck I did my best to add type hints everywhere. Not sure how to resolve the last four remaining mypy errors...

I'm currently on holiday and it's difficult to debug these errors on the phone.
I can do it in 1-2 weeks or someone else can help out?

xarray/core/missing.py Outdated Show resolved Hide resolved
xarray/core/missing.py Show resolved Hide resolved
@max-sixty
Copy link
Collaborator

max-sixty commented Sep 10, 2024

There are a couple of outstanding questions about the interfaces; in particular:

Re typing — are the interfaces typed correctly? I think they are. Assuming those are correct, I would vote that it's fine to add ignores internally (ofc the internal checks are useful tests for the external interfaces, so ideally we'd fix those)

What's the best way of resolving those outstanding questions @pydata/xarray ? I know I've been terrible on attending calls now I'm on PT, but maybe we could do it on one of those? (would be great to minimize the delay on these sorts of PRs — I think they're really valuable and would love to engender more...)

@Ockenfuss
Copy link
Contributor Author

Ockenfuss commented Sep 11, 2024

One more naming decision:

  • Fill gaps limited 7665 #9402 (comment)
    I am on conference this week, but mostly available the next two weeks, if anyone wants to discuss. Otherwise, I will join the next dev meeting on Sep. 25.

@max-sixty
Copy link
Collaborator

Did you manage to discuss this this morning?

@kmuehlbauer
Copy link
Contributor

Did you manage to discuss this this morning?

@max-sixty Yes, @Ockenfuss explained what decisions are to be made and encouraged to have a look here.

@Ockenfuss
Copy link
Contributor Author

Hi @max-sixty, not too much community feedback yet :) What do you think, shall we have another look on the three open naming issues and then push this over the finish line?
I still think this is quite useful functionality for all experimentalists, where gaps of all sizes are common in measurement data.

@max-sixty
Copy link
Collaborator

What do you think, shall we have another look on the three open naming issues and then push this over the finish line?

Yes let's do it! I would bias for action; if folks disagree then dissent welcome.

What's remaining? Personally I think we should simplify this: #9402 (comment) + resolve any names. The external types should be correct; internally some ignores are OK...

@Ockenfuss
Copy link
Contributor Author

Ok, I think from my side this is ready for merge. I stayed with the fill_gaps and limit_area names for now. Check out the doc build pages for fill_gaps and the GapMask object to get a feeling for it and let me know if it feels right :)

This PR leaves all existing fill function API untouched. If we want to bring some of the new limit arguments to the existing functions (like DataArray.fillna(limit=...)), this could be done later in separate PRs.

TBD:

  • Update Whatsnew

@max-sixty
Copy link
Collaborator

Looks great, I think this is indeed very close to landing.

The one thing I'm stuck on is the concept of fill_direction="backward" ... .ffill... — can we explain what the purpose of this is? My mental model is that we're filling missing data from an edge in a single direction; I don't understand how what's going on to fill backwards in a forward fill.

(Totally fine if others get this, I'm not the gatekeeper here, but do think it's important the API makes sense...)

Co-authored-by: Maximilian Roos <[email protected]>
@Ockenfuss
Copy link
Contributor Author

Ockenfuss commented Jan 23, 2025

See this comment for an explanation, or just try out what happens :) For example:

da=xr.DataArray([1, np.nan, np.nan,  2], dims=['x'])
da.fill_gaps('x', limit=1, limit_direction='backward').ffill()
<xarray.DataArray (x: 4)> Size: 32B
array([ 1., nan,  1.,  2.])
Dimensions without coordinates: x

I agree, this is maybe not obvious. Here, I explained why I left it as is. If you want, I can try to keep track of the limit_direction argument upon mask creation and raise an exception, if the user applies ffill later. But this requires some tweaks in the code.

Edit: I added a sentence to the fill_gaps doc page to explain the behaviour better.

@max-sixty
Copy link
Collaborator

Yes, I see what it does, but (very very respectfully, with much admiration for the overall PR!), it doesn't make semantic sense to me.

When I think "backward", I mean "look behind you, is there a gap? OK now fill it". I don't see how it can mean "start one step back, and then decide which direction to continue". Why only one step?

I think we should try hard to have xarray's interfaces be understandable without much technical knowledge. (Indeed, I think there's probably a more precise way of describing my objection around edges and continuous vs. discrete, but hopefully this suffices).

Would you object to removing the limit_direction param, and deferring that to the method?

@Ockenfuss
Copy link
Contributor Author

Thank you very much for the feedback and efforts! I am not entirely sure what you mean with decide and continuous vs discrete, I have to admit. I drew you a little picture, to visualize how I see those things :)
1000039597

But I think, I can rework the API. Just to clarify before I start, what you aim for is the following?

da.fill_gaps('x', limit=1).ffill() # no direction argument anymore in the first function

In the case of interpolate_na and fillna, which work in all directions:

da.fill_gaps('x', limit=1).interpolate_na(limit_direction='forward')

@max-sixty
Copy link
Collaborator

Ha, thanks a lot for the drawing! I appreciate it!

FWIW the step there that didn't make sense to me is "one step backward" for bfill & ffill — my mental model is that filling doesn't require this, and if "one step backward" isn't in others' mental models, they'll also be confused by it...

interpolate_na is an interesting one, since you're correct that can go both ways:

  • Having a kwarg on interpolate_na is reasonable.
  • I could imagine a builder pattern where we just keep running methods .interpolate(...).with_direction(...), but that's not really pythonic and is inconsistent with xarray's API
  • or have limit_direction as it is now, but default to the fill direction and raise an error if it's not the same

Raise error if this is not the case.
This implies that the mask creation needs to be delayed until the filling stage.
@Ockenfuss
Copy link
Contributor Author

I went for option 3: Default to 'forward' in case of ffill and 'backward' in case of bfill. If this is not the case, an error is raised. (This is similar to what pandas is doing, check out the docstring here.)
I adapted the code and docs and added some tests to check this behaviour. What do you think?

@keewis
Copy link
Collaborator

keewis commented Jan 24, 2025

I'm late to the discussion (thanks for sticking with us for so long, even though we've been really, really slow!), but to me ffill with limit_direction="backwards" seems to mean "take the value before the gap, but start filling (and counting) from the back" (and the inverse for bfill: "take the values from after the gap and start filling from the front of the gap"). Which I also find odd, because I can't imagine a use case for that (maybe there is one I just can't see?). We can't use pandas as a precedent; that only has limit_area which is a different thing. I do get the value of limit_direction for interpolate, though. So the question I think is: are we confident that this will end up being useful? Or should this rather be postponed until we have a clear use case?

Other than that, the only suggestion I have is to look at the naming of the API (I guess I'm bikeshedding?):

ds.fill_gaps(...).interpolate(...)  # no need to call it `interpolate_na` since it already is in a namespace?

kinda works, but

ds.fill_gaps(...).ffill(...)
ds.fill_gaps(...).bfill(...)

repeats "fill" and feels a bit awkward (it does fit with the limit* kwargs, though). What about just gaps or, if we don't want to assume everyone understands gaps the same way, na? That would make the API:

ds.gaps(...).ffill(...)
ds.gaps(...).interpolate(...)

or

ds.na(...).ffill(...)
ds.na(...).interpolate(...)

The latter may be a bit short and missing the semantic hint of "we're working on gaps", though. (And either way, these are suggestions, not requests, so feel free to reject them if it doesn't make too much sense or if it has already been discussed in a previous version of this PR).

@Ockenfuss
Copy link
Contributor Author

Thanks you, no worries, all feedback is welcome everytime!!

ffill+direction backward issue

Your understanding is correct. I have no immediate use case for it, but I thought initially why forbidding something that is possible. However, Max also thought this case may be confusing, so we decided to raise an error if attempting it now.

Names

I am open to anything! I am with you that having "gap" somehow in the name helps for the understanding. Just "gaps" is also fine for me.

Anything else

You might wonder: Why a new fill_gaps API function and not just equip the existing functions with new arguments (e.g. for ffill: limit, limit_area, max_gap, limit_use_coordinate)?
Originally, we were mostly concerened about interpolate_na. Stephan and Deepak recommended a new API, since the existing interpolate_na has a very specific legacy behaviour. So for interpolate_na, we need a new API if we want to avoid deprecations. For ffill/bfill, both solutions have something in favour: The current implementation keeps all gap options nicely in one place. Directly adding the arguments instead would allow for some more freedom, e.g. the limit_direction argument could simply be omitted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants