Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protoextrapolate #1112

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ export(run_xml_tests)
export(standardize_iso)
export(unprotect_integer_cols)
importFrom(assertthat,assert_that)
importFrom(assertthat,is.scalar)
importFrom(data.table,data.table)
importFrom(dplyr,anti_join)
importFrom(dplyr,arrange)
Expand Down Expand Up @@ -92,3 +93,4 @@ importFrom(tidyr,separate)
importFrom(tidyr,spread)
importFrom(tidyr,unite)
importFrom(utils,capture.output)
importFrom(utils,tail)
5 changes: 5 additions & 0 deletions R/constants.R
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,11 @@ MODEL_BASE_YEARS <- c(1975, 1990, 2005, 2010)
MODEL_FUTURE_YEARS <- seq(2015, 2100, 5)
MODEL_YEARS <- c(MODEL_BASE_YEARS, MODEL_FUTURE_YEARS)

# intermediate BYU year constant. Need for early dev where not all data sets get extended.
# Goal would be to have it handled automatically with a change to HISTORICAL_YEARS.
# Should be able to just find and replace BYU_YEAR with max(HISTORICAL_YEARS) once the
# BYU skeleton is in place for all chunks.
BYU_YEAR <- 2015
abigailsnyder marked this conversation as resolved.
Show resolved Hide resolved

# GCAM constants ======================================================================

Expand Down
83 changes: 83 additions & 0 deletions R/extrapolators.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# extrapolators.R
# helper extrapolator functions for use in base year changes

#' extrapolate_constant
#'
#' computes the mean of last n original year values (with
#' \code{mean(., na.rm = TRUE)}) and uses this constant value to fill in NA
#' values corresponding to the extrapolation years at the end of the time series.
#' NOTE that this extrapolator does not touch any of the original data. It ONLY
#' fills in data corresponding to the extrapolation years. It is the user's
#' responsibility to account for this behavior in preparing raw data to be
#' extrapolated.
#'
#' @param x Vector of values with NA's corresponding to the extrapolation years
#' to be filled in via constant extrapolation of the mean of last n original
#' year values.
#' @param n Number of final original year values to be averaged to provide the
#' filler value. Averaging is done with \code{na.rm = TRUE}.
#' Defaults to n = 1: using the last recorded year's value to constantly fill in
#' the tail of vector missing values corresponding to extrapolation years.
#' @param numExtrapYrs The number of NA values at the tail end of each vector that
#' correspond to the extrapolation years and will be filled in. This will always
#' be known for each data set in each chunk.
#' @details Computes the mean of last n original year values of input vector x
#' and uses this constant value to fill in NA values in x that correspond to the
#' added extrapolation years.
#' @return Vector with all NA values replaced with the specified mean.
#' @importFrom assertthat assert_that is.scalar
#' @importFrom utils tail
#' @author ACS June 2019
extrapolate_constant <- function(x, n=1, numExtrapYrs){

# Some assertion tests to make sure working on right data types
assert_that(is.numeric(x))
assert_that(is.scalar(n))
assert_that(is.integer(numExtrapYrs))


# The constant value to fill in all extrapolation year NA's with.
# = mean(. , na.rm = TRUE) of the last n values in the original
# data.
index_last_n_orig_yrs <- (length(x) - numExtrapYrs - n + 1):(length(x) - numExtrapYrs)
meanval <- mean(x[index_last_n_orig_yrs], na.rm = TRUE)


# fill in only the tail end, extrapolation years
# NA values with this constant.
index_extrap_yrs <- (length(x) - numExtrapYrs + 1):length(x)
x[index_extrap_yrs] <- meanval

return(x)
}



#' last_n_nonNA
#'
#' finds the last n non-NA values in an input vector.
#' A convenience functions for users who wish to customize
#' their extrapolations beyond the default or who wish to
#' identify NA values in their original (unextrapolated)
#' data.
#'
#' @param x Vector with some NA values
#' @param n The number of non-NA values sought.
#' @details finds the last n non-NA values in an input vector.
#' @return A vector with the last n non-NA values from input
#' vector x.
#' @importFrom assertthat assert_that is.scalar
#' @author ACS June 2019
last_n_nonNA <- function(x, n){

assert_that(is.scalar(n))

if(n > length(x[!is.na(x)])){
stop('asking for more nonNA years than you have.')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😆

}


return(tail(x[!is.na(x)], n))
}


39 changes: 36 additions & 3 deletions R/zchunk_L100.GDP_hist.R
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
#' @importFrom assertthat assert_that
#' @importFrom tibble tibble
#' @importFrom dplyr filter mutate select
#' @importFrom tidyr gather spread
#' @importFrom tidyr gather spread complete
#' @author BBL February 2017
module_socioeconomics_L100.GDP_hist <- function(command, ...) {
if(command == driver.DECLARE_INPUTS) {
Expand All @@ -34,13 +34,46 @@ module_socioeconomics_L100.GDP_hist <- function(command, ...) {
usda_gdp_mer <- get_data(all_data, "socioeconomics/USDA_GDP_MER")
assert_that(tibble::is.tibble(usda_gdp_mer))

# Convert to long form, filter to historical years, convert units
# Convert to long form, convert units
usda_gdp_mer %>%
select(-Country) %>%
gather_years %>%
filter(!is.na(value), !is.na(iso)) %>%
mutate(value = value * CONV_BIL_MIL * gdp_deflator(1990, base_year = 2010),
year = as.integer(year)) %>%
year = as.integer(year)) ->
long_iso_year_gdp


# Perform BYU
# Skeleton - constant extrapolation
# BYUcompliant
#
# BYU NOTE - must make sure the units are all 1990 USD (or at least the
# same year basis) so averaging behaves. Fine here, but worth noting
# across other chunk where money comes up, especially since we drop
# unit information pretty early on in most chunks.
# BYU NOTE - need to think about labeling that the output has BYU
# update done, and a note about the method?
if(max(long_iso_year_gdp$year) < BYU_YEAR){

extrapyears <- (max(long_iso_year_gdp$year) + 1):BYU_YEAR

# Constant extrapolation operating only on
# numeric vector. Can operate on any numeric vector, regardless of grouping
# or column name.
# What gets output in this chunk
long_iso_year_gdp %>%
complete(year = c(year, extrapyears), iso) %>%
group_by(iso) %>%
mutate(value = extrapolate_constant(value, n = 1,
numExtrapYrs = length(extrapyears))) %>%
ungroup ->
long_iso_year_gdp
}


# filter to historical years, convert units
long_iso_year_gdp %>%
add_title("Historical GDP downscaled to country (iso)") %>%
add_comments("Units converted to constant 1990 USD") %>%
add_precursors("socioeconomics/USDA_GDP_MER") %>%
Expand Down
42 changes: 42 additions & 0 deletions man/extrapolate_constant.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

30 changes: 30 additions & 0 deletions man/last_n_nonNA.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

12 changes: 12 additions & 0 deletions tests/testthat/test_oldnew.R
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,18 @@ test_that("matches old data system output", {
olddata <- COMPDATA[[oldf]]
expect_is(olddata, "data.frame", info = paste("No comparison data found for", oldf))

# During the base year update development process, some outputs from the data
# system will be extended and therefore have different dimensions from the
# old comparison data. This causes the old-new test to fail.
# The extended extrapolation years will be dropped from the newdata
# so that comparison can be made to the old data. This also serves to check
# that the extrapolation procedure does not touch original data.
if(max(newdata$year) == BYU_YEAR){ # one way to check that it's a BYU without flags.
newdata %>%
filter(year <= max(HISTORICAL_YEARS)) ->
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the if will do the trick but this filter is not the right way to handle the correction to get the dimensions to agree. So travis is still failing.

The problem is that the last year of data in an input file may not actually be the max(HISTORICAL_YEARS). For the GDP file, the last recorded year is 2013, not 2010. From the travis failures, it looks like there's probably 4 more files that happen to have the last recorded year = 2015 (BYU_YEAR) and not 2010 and so are falling into this if statement as well. There could be even more that have a recorded year like 2012 or 2013 but aren't yet BYU'd to 2015.

I cannot think of an endogenous way to detect what the last recorded year was before extrapolation without adding additional information. I think I have an idea to incorporate it into a new FLAG with some if statements in each zchunk that is BYU'd to handle it. Which should be fine but is also an extra place for the teams to make a mistake, even if we present careful notes and examples. Obviously I'd put some thought into minimizing that.

Alternatively, if we don't think this part of the test will be kept beyond the actual base year updating process, we could also do

olddata %>%
          filter(year <= max(HISTORICAL_YEARS)) ->
          olddata 

We will lose a few years of data from the oldnew test but there's less space for user error.

So, do we want to go flag or slightly less robust oldnew test?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were only 10 or so that failed. Could we just flag them as FLAG_NO_TEST for now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to keep in mind, the FAO Aquastat data are typically written out in (at best) 5-year intervals, with the data falling in years like 2008, 2012, 2017. Often there will be 20-year lapses in a time series. So, at least for that specific data source we don't want to be filtering years prior to filling in the missing data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pkyle Thanks! But here we're talking about the test code, not filtering before filling data.

@abigailsnyder @pralitp I'm also leery of adding extra logic/steps the user is responsible for. Another possible way to address this: weaken the oldnew test so it filters newdata to only the years present in olddata before doing the comparison. This would trim off any extrapolated data; the cost is that now it won't detect bad newdata that have extra years for non-BYU reasons.

newdata
}

# Finally, test (NB rounding numeric columns to a sensible number of
# digits; otherwise spurious mismatches occur)
# Also first converts integer columns to numeric (otherwise test will
Expand Down