Protoextrapolate #1112

abigailsnyder · 2019-06-19T18:31:00Z

This branch contains two different implementations of extrapolate_constant - one that works on vectors, and one that works on grouped data frames that have a column named value.
Both are implemented in the test chunk zchunk_L100.GDP_hist.R.

Absolutely expect there will be iterations among us, but this PR is relatively short at least. Reviewers can take some time looking through it and I'll start on the implementations of the extrapolate_linear, extrapolate_correlated.

…ton into all chunks

bpbond

Thanks @abigailsnyder ! I want to think about the R/zchunk_L100.GDP_hist.R code.

R/constants.R

R/extrapolators.R

pralitp

I'd say I would lean towards option A which just works on a vector. Although for a fairly minor reason that I don't like forcing users to have a value column which option B assumes (We could make passing in the column an argument but I'm not too handy with the NSE stuff myself).

However thinking further, we will have:

repeat_add_columns(missingyears) %>%
	        bind_rows(long_iso_year_gdp, .) %>%
	        group_by(...) %>%
	        extrapolate_ZZZ(n = 1) %>%

Quite a lot. Would it make sense to combine the expand functionality into the extrapolate so it is done in a single step?

abigailsnyder · 2019-07-18T14:17:57Z

@pralitp returning to this now - I too am leaning toward option A (the vector option). In some ways I think it's more flexible, and also I have never gotten NSE stuff to work to pass the column name as an argument either (for the grouped tibble version B).

In terms of shortening the pipeline leading to the call in each chunk, the only real shortening I can see (for either option) is rolling the

repeat_add_columns(missingyears)

step into the extrapolator function. I am working on that now. It's a little tricky to think through in the vector form. Do you have an alternative suggestion?
Also, I don't think we want the extrapolator functions to actually calculate which years are missing. I think part of our coding ethos for this package is keeping things sort of mono-use for user clarity.

abigailsnyder · 2019-07-18T14:28:54Z

@pralitp update: I just realized rolling the

repeat_add_columns(missingyears)

step into the vector form won't work, it will result in the wrong number of rows being returned to the data frame.

If we want to roll that step into the extrapolate function(s), then we will have to go with the grouped tibble version B and either figure out the NSE stuff to have flexible column names, or just force a column name of value. I also don't like the latter option, but I am more ok than usual with it because these calls should be happening early in the pipelines when the tibbles are more likely to have a column named value anyway.

Alternative proposal (what I'm working on now):
I'll add an extra helper function that will identify the missing years and do the repeat_add_columns step to roll that up. That then separates figuring all that out to let us use the vector version A of extrapolation, while also cleaning up the pipelines in each chunk a little.
Let me know what you think!

pralitp · 2019-07-18T15:53:49Z

@abigailsnyder, as I was taking another look. Maybe the deal is repeat_add_columns isn't the right tool for the job because the data already has the year column. Could it be done with just a complete? Then the group_by would still be required but that isn't so bad.

abigailsnyder · 2019-07-18T15:58:26Z

You are correct that it can be done by complete, it will just involve information specific to the data frame being operated on in the call to complete. I'm fine with that - I've basically been trying to do that in a more general helper function and that is turning into a nightmare because complete won't take column names as strings.
If we are ok having a call to complete for each data frame in each chunk, that's definitely easier.

abigailsnyder · 2019-07-18T16:02:27Z

So it will look like this in this chunk

missingyears <- (max(long_iso_year_gdp$year) +  1):BYU_YEAR

      # Constant extrapolation operating only on
      # numeric vector. Can operate on any numeric vector, regardless of grouping
      # or column name.
      # What gets output in this chunk
      long_iso_year_gdp %>%
        complete(year = c(year, missingyears), iso) %>%
        group_by(iso) %>%
        mutate(value = extrapolate_constant(value, n = 1))

Which is definitely cleaner, but that call to complete will vary with each data set we deal with

pralitp · 2019-07-18T16:07:49Z

Yea.. exactly the same arguments as to the group_by. Not totally ideal but I am willing to live with it.

pralitp · 2019-07-18T16:10:00Z

won't take column names as strings.

Hmm, maybe this is where the !!! comes into play.

abigailsnyder · 2019-07-18T16:16:56Z

Hm ok there's some other changes being made for this PR too, so I'll make all those with the ugly, in-chunk complete and then play around with trying to use

!!!

in a more general function a little more after that.

pkyle · 2019-07-18T16:38:56Z

One thing to look out for, I had a group_by() %>% complete() step in a data processing pipeline a few years ago and it took forever. So, if you do end up doing that, I'd recommend to test it out on a big dataset.

abigailsnyder · 2019-07-18T16:40:40Z

@pkyle good point, but this usage should be complete (add missing years for every relevant IDs - just iso here) and then the group_by(ID). I think this should be pretty consistent across the BYU, but I'll keep your note in mind in case we run into any chunks that have your ordering

… error

abigailsnyder · 2019-07-18T17:31:23Z

@pralitp @bpbond ready for re-review with comments addressed. Passed check package locally right before I updated from master, so presumably that should still be fine

R/zchunk_L100.GDP_hist.R

pralitp · 2019-08-08T13:36:11Z

I've confused myself as to where we are on this discussion. Did we settle on just filling tail NA values with average of last n and just being cool with that even if the number of values filled might be different by group (the should we fill missing 2010 problem)?

abigailsnyder · 2019-08-08T15:53:46Z

I too am confused about where we have ended up.
If we handle the (fill in missing 2010 values) via (filling tail NA values with average of last n and just being cool with that even if the number of values filled might be different by group), that's definitely more annoying to code but doable.
I think it will be easiest for us to walk through options in person - I set a meeting for next week and I'll have some concrete examples of options prepared.
@bpbond is out of town all next week, so unless there is an objection, @pralitp and I will just make a decision (though Ben is still included on the meeting invite).

-Doesn't touch original data at all. -now the fillin mean value comes from average of last n original data years with na.rm=TRUE -update documentation to clarify -rename one of the input arguments for clarit

abigailsnyder · 2019-08-12T19:48:23Z

@pralitp @bpbond updates made to the constant extrapolation method + the old-new test as discussed in person Aug 9, 2019. Just FYI

abigailsnyder · 2019-08-13T17:28:21Z

tests/testthat/test_oldnew.R

+      # that the extrapolation procedure does not touch original data.
+      if(max(newdata$year) == BYU_YEAR){ # one way to check that it's a BYU without flags.
+        newdata %>%
+          filter(year <= max(HISTORICAL_YEARS)) ->


I think the if will do the trick but this filter is not the right way to handle the correction to get the dimensions to agree. So travis is still failing.

The problem is that the last year of data in an input file may not actually be the max(HISTORICAL_YEARS). For the GDP file, the last recorded year is 2013, not 2010. From the travis failures, it looks like there's probably 4 more files that happen to have the last recorded year = 2015 (BYU_YEAR) and not 2010 and so are falling into this if statement as well. There could be even more that have a recorded year like 2012 or 2013 but aren't yet BYU'd to 2015.

I cannot think of an endogenous way to detect what the last recorded year was before extrapolation without adding additional information. I think I have an idea to incorporate it into a new FLAG with some if statements in each zchunk that is BYU'd to handle it. Which should be fine but is also an extra place for the teams to make a mistake, even if we present careful notes and examples. Obviously I'd put some thought into minimizing that.

Alternatively, if we don't think this part of the test will be kept beyond the actual base year updating process, we could also do

olddata %>% filter(year <= max(HISTORICAL_YEARS)) -> olddata

We will lose a few years of data from the oldnew test but there's less space for user error.

So, do we want to go flag or slightly less robust oldnew test?

There were only 10 or so that failed. Could we just flag them as FLAG_NO_TEST for now?

One thing to keep in mind, the FAO Aquastat data are typically written out in (at best) 5-year intervals, with the data falling in years like 2008, 2012, 2017. Often there will be 20-year lapses in a time series. So, at least for that specific data source we don't want to be filtering years prior to filling in the missing data.

@pkyle Thanks! But here we're talking about the test code, not filtering before filling data.

@abigailsnyder @pralitp I'm also leery of adding extra logic/steps the user is responsible for. Another possible way to address this: weaken the oldnew test so it filters newdata to only the years present in olddata before doing the comparison. This would trim off any extrapolated data; the cost is that now it won't detect bad newdata that have extra years for non-BYU reasons.

bpbond · 2019-08-14T11:43:01Z

R/extrapolators.R

+  assert_that(is.scalar(n))
+
+  if(n > length(x[!is.na(x)])){
+    stop('asking for more nonNA years than you have.')


russellhz · 2022-04-05T18:14:42Z

@abigailsnyder where does the PR stand? I don't follow everything but it looks like this was primarily for the BYU? @ssmithClimate I think you are just extrapolating from the last available year of data? Would these functions be helpful to merge in?

abigailsnyder added 4 commits June 19, 2019 14:10

Define two different function options for constant extrapolation

30e6c68

add a temporary constant to use while putting the extrapolation skele…

df42c51

…ton into all chunks

wip 2 different options for constant extrapolation

0520fa9

two options for constant extrapolation in the identified test chunk

dcf4df5

abigailsnyder requested review from bpbond and pralitp June 19, 2019 18:31

bpbond reviewed Jun 20, 2019

View reviewed changes

R/constants.R Show resolved Hide resolved

R/extrapolators.R Outdated Show resolved Hide resolved

R/extrapolators.R Outdated Show resolved Hide resolved

R/extrapolators.R Outdated Show resolved Hide resolved

pralitp reviewed Jun 21, 2019

View reviewed changes

Drop the grouped tibble option and improve comments/doc on vector option

f2672a4

abigailsnyder added 6 commits July 18, 2019 12:59

keep only vector form extrapolation, correct filling in all NA values…

d7f5a59

… error

simplify pipeline with call to extrapolate

42d7440

add the import of complete from tidyr to chunk

ee8c56e

Drop call to now non-existent tibble option

685834f

add import of utils to silence package check note

3492f43

Merge branch 'master' into protoextrapolate

384f3bb

abigailsnyder requested review from bpbond and pralitp July 18, 2019 18:01

abigailsnyder mentioned this pull request Jul 18, 2019

BYU only driver #1110

Open

pralitp reviewed Jul 18, 2019

View reviewed changes

R/zchunk_L100.GDP_hist.R Outdated Show resolved Hide resolved

abigailsnyder added 3 commits August 12, 2019 14:59

update constant extrap and rename last_n helper fcn

36283c8

-Doesn't touch original data at all. -now the fillin mean value comes from average of last n original data years with na.rm=TRUE -update documentation to clarify -rename one of the input arguments for clarit

update extrap call chunk for clarity

bde8436

update oldnew test to drop extrap years

c66dc39

abigailsnyder requested a review from pralitp August 12, 2019 19:47

abigailsnyder commented Aug 13, 2019

View reviewed changes

bpbond reviewed Aug 14, 2019

View reviewed changes

R/extrapolators.R

assert_that(is.scalar(n))

if(n > length(x[!is.na(x)])){

stop('asking for more nonNA years than you have.')

Copy link

Member

bpbond Aug 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😆

abigailsnyder assigned bpbond Aug 22, 2019

Base automatically changed from master to main January 19, 2021 20:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Protoextrapolate #1112

Protoextrapolate #1112

abigailsnyder commented Jun 19, 2019

bpbond left a comment

pralitp left a comment

abigailsnyder commented Jul 18, 2019 •

edited

Loading

abigailsnyder commented Jul 18, 2019

pralitp commented Jul 18, 2019

abigailsnyder commented Jul 18, 2019

abigailsnyder commented Jul 18, 2019

pralitp commented Jul 18, 2019

pralitp commented Jul 18, 2019

abigailsnyder commented Jul 18, 2019

pkyle commented Jul 18, 2019

abigailsnyder commented Jul 18, 2019

abigailsnyder commented Jul 18, 2019

pralitp commented Aug 8, 2019

abigailsnyder commented Aug 8, 2019 •

edited

Loading

abigailsnyder commented Aug 12, 2019

abigailsnyder Aug 13, 2019

pralitp Aug 13, 2019

pkyle Aug 14, 2019

bpbond Aug 14, 2019

bpbond Aug 14, 2019

russellhz commented Apr 5, 2022

Protoextrapolate #1112

Are you sure you want to change the base?

Protoextrapolate #1112

Conversation

abigailsnyder commented Jun 19, 2019

bpbond left a comment

Choose a reason for hiding this comment

pralitp left a comment

Choose a reason for hiding this comment

abigailsnyder commented Jul 18, 2019 • edited Loading

abigailsnyder commented Jul 18, 2019

pralitp commented Jul 18, 2019

abigailsnyder commented Jul 18, 2019

abigailsnyder commented Jul 18, 2019

pralitp commented Jul 18, 2019

pralitp commented Jul 18, 2019

abigailsnyder commented Jul 18, 2019

pkyle commented Jul 18, 2019

abigailsnyder commented Jul 18, 2019

abigailsnyder commented Jul 18, 2019

pralitp commented Aug 8, 2019

abigailsnyder commented Aug 8, 2019 • edited Loading

abigailsnyder commented Aug 12, 2019

abigailsnyder Aug 13, 2019

Choose a reason for hiding this comment

pralitp Aug 13, 2019

Choose a reason for hiding this comment

pkyle Aug 14, 2019

Choose a reason for hiding this comment

bpbond Aug 14, 2019

Choose a reason for hiding this comment

bpbond Aug 14, 2019

Choose a reason for hiding this comment

russellhz commented Apr 5, 2022

abigailsnyder commented Jul 18, 2019 •

edited

Loading

abigailsnyder commented Aug 8, 2019 •

edited

Loading