Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protoextrapolate #1112

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open

Protoextrapolate #1112

wants to merge 14 commits into from

Conversation

abigailsnyder
Copy link
Contributor

This branch contains two different implementations of extrapolate_constant - one that works on vectors, and one that works on grouped data frames that have a column named value.
Both are implemented in the test chunk zchunk_L100.GDP_hist.R.

Absolutely expect there will be iterations among us, but this PR is relatively short at least. Reviewers can take some time looking through it and I'll start on the implementations of the extrapolate_linear, extrapolate_correlated.

@abigailsnyder abigailsnyder requested review from bpbond and pralitp June 19, 2019 18:31
Copy link
Member

@bpbond bpbond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @abigailsnyder ! I want to think about the R/zchunk_L100.GDP_hist.R code.

R/constants.R Show resolved Hide resolved
R/extrapolators.R Outdated Show resolved Hide resolved
R/extrapolators.R Outdated Show resolved Hide resolved
R/extrapolators.R Outdated Show resolved Hide resolved
Copy link
Contributor

@pralitp pralitp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say I would lean towards option A which just works on a vector. Although for a fairly minor reason that I don't like forcing users to have a value column which option B assumes (We could make passing in the column an argument but I'm not too handy with the NSE stuff myself).

However thinking further, we will have:

repeat_add_columns(missingyears) %>%
	        bind_rows(long_iso_year_gdp, .) %>%
	        group_by(...) %>%
	        extrapolate_ZZZ(n = 1) %>%

Quite a lot. Would it make sense to combine the expand functionality into the extrapolate so it is done in a single step?

@abigailsnyder
Copy link
Contributor Author

abigailsnyder commented Jul 18, 2019

@pralitp returning to this now - I too am leaning toward option A (the vector option). In some ways I think it's more flexible, and also I have never gotten NSE stuff to work to pass the column name as an argument either (for the grouped tibble version B).

In terms of shortening the pipeline leading to the call in each chunk, the only real shortening I can see (for either option) is rolling the

repeat_add_columns(missingyears)

step into the extrapolator function. I am working on that now. It's a little tricky to think through in the vector form. Do you have an alternative suggestion?
Also, I don't think we want the extrapolator functions to actually calculate which years are missing. I think part of our coding ethos for this package is keeping things sort of mono-use for user clarity.

@abigailsnyder
Copy link
Contributor Author

@pralitp update: I just realized rolling the

repeat_add_columns(missingyears)

step into the vector form won't work, it will result in the wrong number of rows being returned to the data frame.

If we want to roll that step into the extrapolate function(s), then we will have to go with the grouped tibble version B and either figure out the NSE stuff to have flexible column names, or just force a column name of value. I also don't like the latter option, but I am more ok than usual with it because these calls should be happening early in the pipelines when the tibbles are more likely to have a column named value anyway.

Alternative proposal (what I'm working on now):
I'll add an extra helper function that will identify the missing years and do the repeat_add_columns step to roll that up. That then separates figuring all that out to let us use the vector version A of extrapolation, while also cleaning up the pipelines in each chunk a little.
Let me know what you think!

@pralitp
Copy link
Contributor

pralitp commented Jul 18, 2019

@abigailsnyder, as I was taking another look. Maybe the deal is repeat_add_columns isn't the right tool for the job because the data already has the year column. Could it be done with just a complete? Then the group_by would still be required but that isn't so bad.

@abigailsnyder
Copy link
Contributor Author

You are correct that it can be done by complete, it will just involve information specific to the data frame being operated on in the call to complete. I'm fine with that - I've basically been trying to do that in a more general helper function and that is turning into a nightmare because complete won't take column names as strings.
If we are ok having a call to complete for each data frame in each chunk, that's definitely easier.

@abigailsnyder
Copy link
Contributor Author

So it will look like this in this chunk

missingyears <- (max(long_iso_year_gdp$year) +  1):BYU_YEAR

      # Constant extrapolation operating only on
      # numeric vector. Can operate on any numeric vector, regardless of grouping
      # or column name.
      # What gets output in this chunk
      long_iso_year_gdp %>%
        complete(year = c(year, missingyears), iso) %>%
        group_by(iso) %>%
        mutate(value = extrapolate_constant(value, n = 1)) 

Which is definitely cleaner, but that call to complete will vary with each data set we deal with

@pralitp
Copy link
Contributor

pralitp commented Jul 18, 2019

Yea.. exactly the same arguments as to the group_by. Not totally ideal but I am willing to live with it.

@pralitp
Copy link
Contributor

pralitp commented Jul 18, 2019

won't take column names as strings.

Hmm, maybe this is where the !!! comes into play.

@abigailsnyder
Copy link
Contributor Author

Hm ok there's some other changes being made for this PR too, so I'll make all those with the ugly, in-chunk complete and then play around with trying to use

!!!

in a more general function a little more after that.

@pkyle
Copy link
Contributor

pkyle commented Jul 18, 2019

One thing to look out for, I had a group_by() %>% complete() step in a data processing pipeline a few years ago and it took forever. So, if you do end up doing that, I'd recommend to test it out on a big dataset.

@abigailsnyder
Copy link
Contributor Author

@pkyle good point, but this usage should be complete (add missing years for every relevant IDs - just iso here) and then the group_by(ID). I think this should be pretty consistent across the BYU, but I'll keep your note in mind in case we run into any chunks that have your ordering

@abigailsnyder
Copy link
Contributor Author

@pralitp @bpbond ready for re-review with comments addressed. Passed check package locally right before I updated from master, so presumably that should still be fine

R/zchunk_L100.GDP_hist.R Outdated Show resolved Hide resolved
@pralitp
Copy link
Contributor

pralitp commented Aug 8, 2019

I've confused myself as to where we are on this discussion. Did we settle on just filling tail NA values with average of last n and just being cool with that even if the number of values filled might be different by group (the should we fill missing 2010 problem)?

@abigailsnyder
Copy link
Contributor Author

abigailsnyder commented Aug 8, 2019

I too am confused about where we have ended up.
If we handle the (fill in missing 2010 values) via (filling tail NA values with average of last n and just being cool with that even if the number of values filled might be different by group), that's definitely more annoying to code but doable.
I think it will be easiest for us to walk through options in person - I set a meeting for next week and I'll have some concrete examples of options prepared.
@bpbond is out of town all next week, so unless there is an objection, @pralitp and I will just make a decision (though Ben is still included on the meeting invite).

-Doesn't touch original data at all.
-now the fillin mean value comes from average of last n original data years with na.rm=TRUE
-update documentation to clarify
-rename one of the input arguments for clarit
@abigailsnyder abigailsnyder requested a review from pralitp August 12, 2019 19:47
@abigailsnyder
Copy link
Contributor Author

@pralitp @bpbond updates made to the constant extrapolation method + the old-new test as discussed in person Aug 9, 2019. Just FYI

# that the extrapolation procedure does not touch original data.
if(max(newdata$year) == BYU_YEAR){ # one way to check that it's a BYU without flags.
newdata %>%
filter(year <= max(HISTORICAL_YEARS)) ->
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the if will do the trick but this filter is not the right way to handle the correction to get the dimensions to agree. So travis is still failing.

The problem is that the last year of data in an input file may not actually be the max(HISTORICAL_YEARS). For the GDP file, the last recorded year is 2013, not 2010. From the travis failures, it looks like there's probably 4 more files that happen to have the last recorded year = 2015 (BYU_YEAR) and not 2010 and so are falling into this if statement as well. There could be even more that have a recorded year like 2012 or 2013 but aren't yet BYU'd to 2015.

I cannot think of an endogenous way to detect what the last recorded year was before extrapolation without adding additional information. I think I have an idea to incorporate it into a new FLAG with some if statements in each zchunk that is BYU'd to handle it. Which should be fine but is also an extra place for the teams to make a mistake, even if we present careful notes and examples. Obviously I'd put some thought into minimizing that.

Alternatively, if we don't think this part of the test will be kept beyond the actual base year updating process, we could also do

olddata %>%
          filter(year <= max(HISTORICAL_YEARS)) ->
          olddata 

We will lose a few years of data from the oldnew test but there's less space for user error.

So, do we want to go flag or slightly less robust oldnew test?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were only 10 or so that failed. Could we just flag them as FLAG_NO_TEST for now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to keep in mind, the FAO Aquastat data are typically written out in (at best) 5-year intervals, with the data falling in years like 2008, 2012, 2017. Often there will be 20-year lapses in a time series. So, at least for that specific data source we don't want to be filtering years prior to filling in the missing data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pkyle Thanks! But here we're talking about the test code, not filtering before filling data.

@abigailsnyder @pralitp I'm also leery of adding extra logic/steps the user is responsible for. Another possible way to address this: weaken the oldnew test so it filters newdata to only the years present in olddata before doing the comparison. This would trim off any extrapolated data; the cost is that now it won't detect bad newdata that have extra years for non-BYU reasons.

assert_that(is.scalar(n))

if(n > length(x[!is.na(x)])){
stop('asking for more nonNA years than you have.')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😆

Base automatically changed from master to main January 19, 2021 20:24
@russellhz
Copy link
Collaborator

@abigailsnyder where does the PR stand? I don't follow everything but it looks like this was primarily for the BYU? @ssmithClimate I think you are just extrapolating from the last available year of data? Would these functions be helpful to merge in?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants