Skip to content

Commit

Permalink
bugfix on nearest neighbor selection
Browse files Browse the repository at this point in the history
* also updates to vignette
* also updates to tests
  • Loading branch information
ha0ye committed Dec 4, 2017
1 parent a2b18c4 commit 5a9af93
Show file tree
Hide file tree
Showing 10 changed files with 102 additions and 20 deletions.
7 changes: 4 additions & 3 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
Package: rEDM
Type: Package
Title: Applications of Empirical Dynamic Modeling from Time Series
Version: 0.6.8
Date: 2017-11-13
Version: 0.6.9
Date: 2017-12-04
Maintainer: Hao Ye <[email protected]>
Author: Hao Ye [aut, cre],
Adam Clark [aut],
Expand All @@ -14,6 +14,7 @@ Author: Hao Ye [aut, cre],
Jane Cowles [ctb],
James Stagge [ctb],
Yair Daon [ctb],
Andrew Edwards [ctb],
George Sugihara [ctb, ccp]
Description: A new implementation of EDM algorithms based on research software previously developed for internal use in the Sugihara Lab (UCSD/SIO). Contains C++ compiled objects that use time delay embedding to perform state-space reconstruction and nonlinear forecasting and an R interface to those objects using 'Rcpp'. It supports both the simplex projection method from Sugihara & May (1990) <DOI:10.1038/344734a0> and the S-map algorithm in Sugihara (1994) <DOI:10.1098/rsta.1994.0106>. In addition, this package implements convergent cross mapping as described in Sugihara et al. (2012) <DOI:10.1126/science.1227079> and multiview embedding as described in Ye & Sugihara (2016) <DOI:10.1126/science.aag0863>.
License: file LICENSE
Expand All @@ -23,6 +24,6 @@ Imports:
methods
LinkingTo: Rcpp, RcppEigen
RcppModules: lnlp_module, block_lnlp_module, ccm_module
Suggests: knitr, rmarkdown, R.rsp, ggplot2
Suggests: knitr, rmarkdown, R.rsp, ggplot2, testthat
VignetteBuilder: knitr, R.rsp
RoxygenNote: 6.0.1
6 changes: 4 additions & 2 deletions inst/tests/test_block_gp.R
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,13 @@ output <- block_gp(block, columns = c("x", "y"),
v_e = seq(from = -300, to = 300, by = 50),
eta = 7,
fit_params = FALSE,
first_column_time = TRUE, stats_only = FALSE)
first_column_time = TRUE, stats_only = FALSE,
silent = TRUE)
plot(output$model_output[[4]]$obs,
output$model_output[[4]]$pred)

# multivariate simplex projection using x and y to predict x
output <- block_gp(block, columns = c("x", "y"),
first_column_time = TRUE, stats_only = FALSE,
save_covariance_matrix = TRUE)
save_covariance_matrix = TRUE,
silent = TRUE)
6 changes: 4 additions & 2 deletions inst/tests/test_block_lnlp.R
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@ block <- two_species_model[1:200, ]

# multivariate simplex projection using x and y to predict x
output <- block_lnlp(block, columns = c("x", "y"),
first_column_time = TRUE, stats_only = FALSE)
first_column_time = TRUE, stats_only = FALSE,
silent = TRUE)

output <- output$model_output[[1]]
output$pred_err <- sqrt(output$pred_var)
Expand All @@ -20,4 +21,5 @@ for (i in t)
}

# cross mapping using x to predict y
block_lnlp(block, target_column = 2, columns = c("x"), first_column_time = TRUE)
block_lnlp(block, target_column = 2, columns = c("x"), first_column_time = TRUE,
silent = TRUE)
3 changes: 2 additions & 1 deletion inst/tests/test_ccm.R
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,8 @@ data("sardine_anchovy_sst")
anchovy_xmap_sst <- ccm(sardine_anchovy_sst, E = 3,
lib_column = "anchovy", target_column = "np_sst",
lib_sizes = seq(10, 80, by = 10), num_samples = 100,
RNGseed = 42)
RNGseed = 42,
silent = TRUE)

a_xmap_t_means <- ccm_means(anchovy_xmap_sst)
plot(a_xmap_t_means$lib_size, a_xmap_t_means$rho, type = "l",
Expand Down
5 changes: 3 additions & 2 deletions inst/tests/test_lnlp.R
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,9 @@ ts <- two_species_model$x[1:200]
simplex(ts, lib = c(1, 100), pred = c(101, 200))

# univariate s-map using E = 2
s_map(ts, E = 2)
s_map(ts, E = 2, silent = TRUE)

# univariate s-map using E = 2, theta = 1,
# and full output with smap_coefficients
y <- s_map(ts, E = 2, theta = 1, save_smap_coefficients = TRUE)
y <- s_map(ts, E = 2, theta = 1, save_smap_coefficients = TRUE,
silent = TRUE)
6 changes: 4 additions & 2 deletions inst/tests/test_multiview.R
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,13 @@ data("block_3sp")
block <- block_3sp[, c(2, 5, 8)]

# do multiview using top 1, top 3, top k = sqrt(m) embeddings
multiview(block, k = c(1, 3, "sqrt"))
multiview(block, k = c(1, 3, "sqrt"),
silent = TRUE)

# same, but save output for plotting
output <- multiview(block, k = c(1, 3, "sqrt"),
stats_only = FALSE)
stats_only = FALSE,
silent = TRUE)

plot(output$model_output[[3]]$obs,
output$model_output[[3]]$pred)
63 changes: 63 additions & 0 deletions inst/tests/test_simplex_calculation.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
library(rEDM)

ts <- c(-0.056531409251883, 0.059223778257432, 5.24124928046977, -4.85399581474521,
-0.46134818068973, 0.273317575696793, 0.801806230470337, -0.888891901824982,
-0.202777622745051, 0.497565422757662, 5.10219324323769, -5.36826459373178,
-0.17467165498718, 1.06545333399298, 1.97419279178678, -2.91448405223082,
-0.179969867858605, 0.237962494942611, 1.47828327622468, -1.54267507064286,
-0.180342027136338, 0.238919610831881, 1.06140368490958, -1.06522901782019,
-0.214923527940395, 0.452847221462308, 2.13053391555372, -2.55145224744286,
-0.0307653352327702, 1.1448014288826,-0.0675575239486375, -1.04711881585576,
-0.00910890042051652, 0.726257323277433, 0.732271192186161, -1.35460378982395,
-0.0322955446760023, 0.507606440290776, 3.73396587274012, -4.19686615950143,
-0.0997201857962038, 0.753392632401029, 2.41347231553437, -3.03677401452137,
-0.141112562089696, 0.446002103079665, 0.223768504955365, -0.615452831633047,
-0.0216659723974975, 0.292246351104258, 0.20006105300258, -0.469596514211075,
0.0422676544887819, 0.474264989176278, -0.0416811459395667, -0.53555712696719,
0.118860281628173, 0.176335117268894,-0.10364820567334, -0.153572235117542,
0.180339482186409, 0.0566876206447625, -0.140537892644139, 0.0252441742388871,
0.340689505466622, 0.852833653689839, -1.07051231019616, -0.0937704380137284,
0.460677118593916, 0.37444382348273, -0.83783628206217, -0.0154896108244113,
1.34259279914848, -0.495978821807168, -0.472464634960208, -0.415481769949074,
1.36767605087962, -0.891896943918948,-0.279228283931612, -0.148703043863421,
2.04524590138255,-1.98431486665908, 0.0602356391036573, -0.0902730939678147,
0.243344379963862, -0.074421904114315, -0.309150440565139, 0.43675531763949,
0.178787692802827, 0.0799271040758849, -0.657946157906476, 1.14668210755046,
-0.791665479471326, 0.482533897248175, -0.798737571552661, 0.439024256063545,
0.177114631209318, 2.19942374686687, -2.9488856529422)

# construct lagged block
lag_block <- cbind(c(ts[2:length(ts)], NA), ts, c(NA, ts[1:(length(ts)-1)]))
t <- c(2:63, 65:99)

# lib and pred portions
lib_block <- cbind(t+1, lag_block[t,])
pred_block <- cbind(65, lag_block[64, , drop = FALSE])

block <- rbind(lib_block, pred_block)

# make EDM forecast
out <- block_lnlp(block,
lib = c(1, NROW(lib_block)),
pred = c(NROW(lib_block)+1, NROW(lib_block)+1),
first_column_time = TRUE,
tp = 0,
columns = c(2, 3), target_column = 1,
stats_only = FALSE, silent = TRUE)
model_est <- out$model_output[[1]]$pred

# manually calculate distances and neighbors
dist_mat <- as.matrix(dist(block[, 3:4]))
dist_vec <- dist_mat[NROW(dist_mat),]
dist_vec[length(dist_vec)] <- NA
nn <- order(dist_vec)[1:3] # 3 closest neighbors
weights = exp(- dist_vec[nn] / dist_vec[nn[1]])
est = sum(weights * block[nn, 2]) / sum(weights) # weighted average

testthat::test_that("package prediction is same as manual calculation", {
testthat::expect_equal(model_est, est)
})




3 changes: 2 additions & 1 deletion inst/tests/test_tde_gp.R
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@ ts <- two_species_model$x[1:200]

# univariate forecasting with Gaussian processes and E = 1:10, using maximum
# likelihood to estimate params over library points
out <- tde_gp(ts)
out <- tde_gp(ts,
silent = TRUE)

# univariate forecasting with Gaussian processes and E = 5, and first half to
# predict second half
Expand Down
15 changes: 11 additions & 4 deletions src/forecast_machine.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -185,19 +185,27 @@ std::vector<size_t> ForecastMachine::find_nearest_neighbors(const vec& dist)
else
{
size_t i;
nearest_neighbors.push_back(which_lib[0]);
for(auto curr_lib: which_lib)
{
// distance to current neighbor under examination
curr_distance = dist[curr_lib];
if(curr_distance <= dist[nearest_neighbors.back()])

// We want to include the current neighbor:
// if haven't populated neighbors vector, or
// if current neighbor is nearer than farthest away neighbor
if(nearest_neighbors.size() < nn ||
curr_distance <= dist[nearest_neighbors.back()])
{
// find the correct place to insert the current neighbor
i = nearest_neighbors.size();
while((i > 0) && (curr_distance < dist[nearest_neighbors[i-1]]))
{
i--;
}
nearest_neighbors.insert(nearest_neighbors.begin()+i, curr_lib);

// if we've added too many neighbors and there isn't a tie, then
// pop off the farthest neighbor
if((nearest_neighbors.size() > nn) &&
(dist[nearest_neighbors[nn-1]] < dist[nearest_neighbors.back()]))
{
Expand Down Expand Up @@ -454,7 +462,6 @@ void ForecastMachine::simplex_prediction(const size_t start, const size_t end)
nearest_neighbors = find_nearest_neighbors(distances[curr_pred]);
}
effective_nn = nearest_neighbors.size();

if(effective_nn == 0)
{
predicted[curr_pred] = qnan;
Expand Down Expand Up @@ -483,7 +490,7 @@ void ForecastMachine::simplex_prediction(const size_t start, const size_t end)
min_weight);
}
}

// identify ties and adjust weights
if(effective_nn > nn) // ties exist
{
Expand Down
8 changes: 5 additions & 3 deletions vignettes/rEDM-tutorial.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,9 @@ There are many applications for using this approach to recover system dynamics f

## Nearest Neighbor Forecasting using Simplex Projection

As mentioned previously, the reconstruction will map one-to-one to the original attractor manifold if enough lags are used (i.e. if the reconstruction has a sufficiently large embedding dimension). If the embedding dimension is too small, then reconstructed states will overlap and appear to be the same even though they actually correspond to different states. These "singularities" will result in poor forecast performance, because the system behavior cannot be uniquely determined in the reconstruction. Thus, we can use prediction skill as an indicator for identifying the optimal embedding dimension. In the following example, we demonstrate how this can be accomplished using a nearest neighbor forecastaing method, Simplex Projection [@Sugihara_1990], implemented in **rEDM** as the function `simplex`.
As mentioned previously, the reconstruction will map one-to-one to the original attractor manifold if enough lags are used (i.e. if the reconstruction has a sufficiently large embedding dimension). If the embedding dimension is too small, then reconstructed states may overlap and appear to be the same even though they actually correspond to different *true* states of the system. These "singularities" will result in poor forecast performance, because the system behavior cannot be uniquely determined everywhere. Thus, we can use prediction skill as an indicator for the optimal embedding dimension.

In the following example, we demonstrate how this can be accomplished using a nearest neighbor forecasting method, Simplex Projection [@Sugihara_1990], to produce forecasts, and using the correlation between observed and predicted values as the measure of prediction skill.

### Example

Expand Down Expand Up @@ -101,11 +103,11 @@ lib <- c(1, 100)
pred <- c(201, 500)
```

We begin by initializing the `lib` and `pred` variables. These determine which portions of the data will be used to create the reconstruction, and which portions of the data the reconstruction will be used to make forecasts on. In other words, defining the "training" and "test" subsets of the data. Here, the first 100 points (rows 1 to 100) in the time series constitute the "library", and 300 points (rows 201 to 500) are the "prediction set" for which the model will produce forecasts.
We begin by initializing the `lib` and `pred` variables. These determine which portions of the data will be used to create the reconstruction, and which portions will be used to make forecasts on. In other words, defining the "training" and "test" subsets of the data. Here, the first 100 points (rows 1 to 100) in the time series constitute the "library", and 300 points (rows 201 to 500) are the "prediction set" over which the model will produce forecasts.

The remaining parameters will be left at their default values (see section \ref{sec_general_parameters} for details). For the `simplex` function, this means that the embedding dimension, $E$, will range from $1$ to $10$.

*Note that if the code detects any overlap in the lib and pred, it will enable leave-one-out cross-validation and return a warning message.*
*Note that if the code detects any overlap in the lib and pred, it will enable leave-one-out cross-validation and produce a warning message.*

```{r simplex on tentmap}
simplex_output <- simplex(ts, lib, pred)
Expand Down

0 comments on commit 5a9af93

Please sign in to comment.