-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Protoextrapolate #1112
base: main
Are you sure you want to change the base?
Protoextrapolate #1112
Conversation
…ton into all chunks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @abigailsnyder ! I want to think about the R/zchunk_L100.GDP_hist.R
code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say I would lean towards option A which just works on a vector. Although for a fairly minor reason that I don't like forcing users to have a value
column which option B assumes (We could make passing in the column an argument but I'm not too handy with the NSE stuff myself).
However thinking further, we will have:
repeat_add_columns(missingyears) %>%
bind_rows(long_iso_year_gdp, .) %>%
group_by(...) %>%
extrapolate_ZZZ(n = 1) %>%
Quite a lot. Would it make sense to combine the expand functionality into the extrapolate so it is done in a single step?
@pralitp returning to this now - I too am leaning toward option A (the vector option). In some ways I think it's more flexible, and also I have never gotten NSE stuff to work to pass the column name as an argument either (for the grouped tibble version B). In terms of shortening the pipeline leading to the call in each chunk, the only real shortening I can see (for either option) is rolling the
step into the extrapolator function. I am working on that now. It's a little tricky to think through in the vector form. Do you have an alternative suggestion? |
@pralitp update: I just realized rolling the
step into the vector form won't work, it will result in the wrong number of rows being returned to the data frame. If we want to roll that step into the extrapolate function(s), then we will have to go with the grouped tibble version B and either figure out the NSE stuff to have flexible column names, or just force a column name of Alternative proposal (what I'm working on now): |
@abigailsnyder, as I was taking another look. Maybe the deal is |
You are correct that it can be done by |
So it will look like this in this chunk
Which is definitely cleaner, but that call to |
Yea.. exactly the same arguments as to the |
Hmm, maybe this is where the |
Hm ok there's some other changes being made for this PR too, so I'll make all those with the ugly, in-chunk complete and then play around with trying to use
in a more general function a little more after that. |
One thing to look out for, I had a group_by() %>% complete() step in a data processing pipeline a few years ago and it took forever. So, if you do end up doing that, I'd recommend to test it out on a big dataset. |
@pkyle good point, but this usage should be |
I've confused myself as to where we are on this discussion. Did we settle on just filling tail |
I too am confused about where we have ended up. |
-Doesn't touch original data at all. -now the fillin mean value comes from average of last n original data years with na.rm=TRUE -update documentation to clarify -rename one of the input arguments for clarit
# that the extrapolation procedure does not touch original data. | ||
if(max(newdata$year) == BYU_YEAR){ # one way to check that it's a BYU without flags. | ||
newdata %>% | ||
filter(year <= max(HISTORICAL_YEARS)) -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the if
will do the trick but this filter
is not the right way to handle the correction to get the dimensions to agree. So travis is still failing.
The problem is that the last year of data in an input file may not actually be the max(HISTORICAL_YEARS)
. For the GDP file, the last recorded year is 2013, not 2010. From the travis failures, it looks like there's probably 4 more files that happen to have the last recorded year = 2015 (BYU_YEAR
) and not 2010 and so are falling into this if
statement as well. There could be even more that have a recorded year like 2012 or 2013 but aren't yet BYU'd to 2015.
I cannot think of an endogenous way to detect what the last recorded year was before extrapolation without adding additional information. I think I have an idea to incorporate it into a new FLAG with some if
statements in each zchunk that is BYU'd to handle it. Which should be fine but is also an extra place for the teams to make a mistake, even if we present careful notes and examples. Obviously I'd put some thought into minimizing that.
Alternatively, if we don't think this part of the test will be kept beyond the actual base year updating process, we could also do
olddata %>%
filter(year <= max(HISTORICAL_YEARS)) ->
olddata
We will lose a few years of data from the oldnew test but there's less space for user error.
So, do we want to go flag or slightly less robust oldnew test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were only 10 or so that failed. Could we just flag them as FLAG_NO_TEST
for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to keep in mind, the FAO Aquastat data are typically written out in (at best) 5-year intervals, with the data falling in years like 2008, 2012, 2017. Often there will be 20-year lapses in a time series. So, at least for that specific data source we don't want to be filtering years prior to filling in the missing data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pkyle Thanks! But here we're talking about the test code, not filtering before filling data.
@abigailsnyder @pralitp I'm also leery of adding extra logic/steps the user is responsible for. Another possible way to address this: weaken the oldnew test so it filters newdata
to only the years present in olddata
before doing the comparison. This would trim off any extrapolated data; the cost is that now it won't detect bad newdata that have extra years for non-BYU reasons.
assert_that(is.scalar(n)) | ||
|
||
if(n > length(x[!is.na(x)])){ | ||
stop('asking for more nonNA years than you have.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😆
@abigailsnyder where does the PR stand? I don't follow everything but it looks like this was primarily for the BYU? @ssmithClimate I think you are just extrapolating from the last available year of data? Would these functions be helpful to merge in? |
This branch contains two different implementations of
extrapolate_constant
- one that works on vectors, and one that works on grouped data frames that have a column namedvalue
.Both are implemented in the test chunk
zchunk_L100.GDP_hist.R
.Absolutely expect there will be iterations among us, but this PR is relatively short at least. Reviewers can take some time looking through it and I'll start on the implementations of the
extrapolate_linear
,extrapolate_correlated
.