-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fill gaps limited 7665 #9402
base: main
Are you sure you want to change the base?
Fill gaps limited 7665 #9402
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems extremely good!!
I left some refine-y comments, nothing preclusive.
What do others think? I don't use interpolate_na
myself much so other may have views...
| datetime.timedelta | ||
) = None, | ||
limit_direction: LimitDirectionOptions = "both", | ||
limit_area: LimitAreaOptions | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refine-y but:
- are
interpolate
andextrapolate
clearer options? - is there a better name for the param? I'm not sure. "area" sounds 2D to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. I adopted this nomenclature from pandas. What about limit_region
, does this sound less 2D?
Personally, I like the 'inside'/'outside' options, 'interpolate' sounds close to linear interpolation to me...
@@ -3766,6 +3769,160 @@ def bfill(self, dim: Hashable, limit: int | None = None) -> Self: | |||
|
|||
return bfill(self, dim, limit=limit) | |||
|
|||
def fill_gaps( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any thoughts from others on the naming? Would .fill
be insufficiently specific that it's filling na
? Would fill_missing
be clearer than fill_gaps
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open to any of those. .fill
sounds very concise, but maybe this is easily confused with .ffill
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think .fill
could be quite nice, do others have a view?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe gap_filler
instead, since this method does not actually fill the gaps.
I'm also wondering if its better to have a method that constructs the appropriate mask that can be used later
mask = ds.get_gap_mask(max_gap=...)
ds.ffill(...).where(~mask)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points! Just to explain:
gap_filler
emphasizes the returned object type nicely! However, I choose fill_gaps
because it fits the naming scheme of other object-returning functions better (e.g. rolling
and coarsen
are not called roller
and coarser
in xarray, even though the operation is not perfomed immediately and an object is returned).
Ultimately (I am a non-native english speaker) I am happy for any recommendations regarding nomenclature.
If you prefer gap_filler
, I will change accordingly.
The function API is also presented as an alternative in the initial proposal. I decided to go for the object way because it is shorter (one line) and less error prone (you might easily forget the ~
). If the mask is required, you can easily get it from the object:
mask=ds.gap_filler(...).mask
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically all the new methods are missing return types, please consider adding them.
The new class should be generic, I have added a few review comments here and there but more adoptions are needed.
Thanks a lot for the review of the typing! I`ll work through them asap and try to make all adoptions (probabily still need some help afterwards, though ...) |
@headtr1ck I did my best to add type hints everywhere. Not sure how to resolve the last four remaining mypy errors... |
I'm currently on holiday and it's difficult to debug these errors on the phone. |
There are a couple of outstanding questions about the interfaces; in particular:
Re typing — are the interfaces typed correctly? I think they are. Assuming those are correct, I would vote that it's fine to add ignores internally (ofc the internal checks are useful tests for the external interfaces, so ideally we'd fix those) What's the best way of resolving those outstanding questions @pydata/xarray ? I know I've been terrible on attending calls now I'm on PT, but maybe we could do it on one of those? (would be great to minimize the delay on these sorts of PRs — I think they're really valuable and would love to engender more...) |
One more naming decision:
|
Did you manage to discuss this this morning? |
@max-sixty Yes, @Ockenfuss explained what decisions are to be made and encouraged to have a look here. |
Hi @max-sixty, not too much community feedback yet :) What do you think, shall we have another look on the three open naming issues and then push this over the finish line? |
Yes let's do it! I would bias for action; if folks disagree then dissent welcome. What's remaining? Personally I think we should simplify this: #9402 (comment) + resolve any names. The external types should be correct; internally some ignores are OK... |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
- Allows for more flexibility, e.g. optional dim arguments - Better static typing
Ok, I think from my side this is ready for merge. I stayed with the fill_gaps and limit_area names for now. Check out the doc build pages for fill_gaps and the GapMask object to get a feeling for it and let me know if it feels right :) This PR leaves all existing fill function API untouched. If we want to bring some of the new limit arguments to the existing functions (like DataArray.fillna(limit=...)), this could be done later in separate PRs. TBD:
|
Looks great, I think this is indeed very close to landing. The one thing I'm stuck on is the concept of (Totally fine if others get this, I'm not the gatekeeper here, but do think it's important the API makes sense...) |
Co-authored-by: Maximilian Roos <[email protected]>
See this comment for an explanation, or just try out what happens :) For example: da=xr.DataArray([1, np.nan, np.nan, 2], dims=['x'])
da.fill_gaps('x', limit=1, limit_direction='backward').ffill()
I agree, this is maybe not obvious. Here, I explained why I left it as is. If you want, I can try to keep track of the limit_direction argument upon mask creation and raise an exception, if the user applies ffill later. But this requires some tweaks in the code. Edit: I added a sentence to the fill_gaps doc page to explain the behaviour better. |
Yes, I see what it does, but (very very respectfully, with much admiration for the overall PR!), it doesn't make semantic sense to me. When I think "backward", I mean "look behind you, is there a gap? OK now fill it". I don't see how it can mean "start one step back, and then decide which direction to continue". Why only one step? I think we should try hard to have xarray's interfaces be understandable without much technical knowledge. (Indeed, I think there's probably a more precise way of describing my objection around edges and continuous vs. discrete, but hopefully this suffices). Would you object to removing the |
Ha, thanks a lot for the drawing! I appreciate it! FWIW the step there that didn't make sense to me is "one step backward" for
|
Raise error if this is not the case. This implies that the mask creation needs to be delayed until the filling stage.
I went for option 3: Default to 'forward' in case of ffill and 'backward' in case of bfill. If this is not the case, an error is raised. (This is similar to what pandas is doing, check out the docstring here.) |
I'm late to the discussion (thanks for sticking with us for so long, even though we've been really, really slow!), but to me Other than that, the only suggestion I have is to look at the naming of the API (I guess I'm bikeshedding?): ds.fill_gaps(...).interpolate(...) # no need to call it `interpolate_na` since it already is in a namespace? kinda works, but ds.fill_gaps(...).ffill(...)
ds.fill_gaps(...).bfill(...) repeats "fill" and feels a bit awkward (it does fit with the ds.gaps(...).ffill(...)
ds.gaps(...).interpolate(...) or ds.na(...).ffill(...)
ds.na(...).interpolate(...) The latter may be a bit short and missing the semantic hint of "we're working on gaps", though. (And either way, these are suggestions, not requests, so feel free to reject them if it doesn't make too much sense or if it has already been discussed in a previous version of this PR). |
Thanks you, no worries, all feedback is welcome everytime!! ffill+direction backward issueYour understanding is correct. I have no immediate use case for it, but I thought initially why forbidding something that is possible. However, Max also thought this case may be confusing, so we decided to raise an error if attempting it now. NamesI am open to anything! I am with you that having "gap" somehow in the name helps for the understanding. Just "gaps" is also fine for me. Anything elseYou might wonder: Why a new fill_gaps API function and not just equip the existing functions with new arguments (e.g. for ffill: limit, limit_area, max_gap, limit_use_coordinate)? |
Improve gap filling
fill_gaps()
to control gap length for all filling functions (ffill/bfill/interpolate_na/fillna)This PR involves a full implementation, documentation and corresponding tests. As discussed in #7665, a new API function
fill_gaps
was introduced, to avoid breaking changes of existing code and keep the argument inflation of the filling functions to a minimum.TBD if accepted:
whats-new.rst
api.rst
Deprecations
limit
in interpolate_na now requires numbagg or bottleneck. So far, limit worked without bottleneck, max_gap required it (not documented). Reason: max_gap and limit now share the same codebase using ffill. Since no one ever complained about the missing documentation and the error message is pretty clear and easy to resolve ("No module named 'bottleneck"), I hope this might be acceptable.