bugfix on nearest neighbor selection

* also updates to vignette * also updates to tests
SugiharaLab · Dec 4, 2017 · 5a9af93 · 5a9af93
1 parent a2b18c4
commit 5a9af93
Show file tree

Hide file tree

Showing 10 changed files with 102 additions and 20 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,8 +1,8 @@
 Package: rEDM
 Type: Package
 Title: Applications of Empirical Dynamic Modeling from Time Series
-Version: 0.6.8
-Date: 2017-11-13
+Version: 0.6.9
+Date: 2017-12-04
 Maintainer: Hao Ye <[email protected]>
 Author: Hao Ye [aut, cre], 
     Adam Clark [aut], 
@@ -14,6 +14,7 @@ Author: Hao Ye [aut, cre],
     Jane Cowles [ctb], 
     James Stagge [ctb], 
     Yair Daon [ctb], 
+    Andrew Edwards [ctb], 
     George Sugihara [ctb, ccp]
 Description: A new implementation of EDM algorithms based on research software previously developed for internal use in the Sugihara Lab (UCSD/SIO). Contains C++ compiled objects that use time delay embedding to perform state-space reconstruction and nonlinear forecasting and an R interface to those objects using 'Rcpp'. It supports both the simplex projection method from Sugihara & May (1990) <DOI:10.1038/344734a0> and the S-map algorithm in Sugihara (1994) <DOI:10.1098/rsta.1994.0106>. In addition, this package implements convergent cross mapping as described in Sugihara et al. (2012) <DOI:10.1126/science.1227079> and multiview embedding as described in Ye & Sugihara (2016) <DOI:10.1126/science.aag0863>.
 License: file LICENSE
@@ -23,6 +24,6 @@ Imports:
     methods
 LinkingTo: Rcpp, RcppEigen
 RcppModules: lnlp_module, block_lnlp_module, ccm_module
-Suggests: knitr, rmarkdown, R.rsp, ggplot2
+Suggests: knitr, rmarkdown, R.rsp, ggplot2, testthat
 VignetteBuilder: knitr, R.rsp
 RoxygenNote: 6.0.1
diff --git a/inst/tests/test_block_gp.R b/inst/tests/test_block_gp.R
@@ -10,11 +10,13 @@ output <- block_gp(block, columns = c("x", "y"),
                    v_e = seq(from = -300, to = 300, by = 50), 
                    eta = 7, 
                    fit_params = FALSE, 
-                   first_column_time = TRUE, stats_only = FALSE)
+                   first_column_time = TRUE, stats_only = FALSE, 
+                   silent = TRUE)
 plot(output$model_output[[4]]$obs, 
      output$model_output[[4]]$pred)
 
 # multivariate simplex projection using x and y to predict x
 output <- block_gp(block, columns = c("x", "y"), 
                    first_column_time = TRUE, stats_only = FALSE, 
-                   save_covariance_matrix = TRUE)
+                   save_covariance_matrix = TRUE, 
+                   silent = TRUE)
diff --git a/inst/tests/test_block_lnlp.R b/inst/tests/test_block_lnlp.R
@@ -6,7 +6,8 @@ block <- two_species_model[1:200, ]
 
 # multivariate simplex projection using x and y to predict x
 output <- block_lnlp(block, columns = c("x", "y"), 
-                     first_column_time = TRUE, stats_only = FALSE)
+                     first_column_time = TRUE, stats_only = FALSE, 
+                     silent = TRUE)
 
 output <- output$model_output[[1]]
 output$pred_err <- sqrt(output$pred_var)
@@ -20,4 +21,5 @@ for (i in t)
 }
 
 # cross mapping using x to predict y
-block_lnlp(block, target_column = 2, columns = c("x"), first_column_time = TRUE)
+block_lnlp(block, target_column = 2, columns = c("x"), first_column_time = TRUE, 
+           silent = TRUE)
diff --git a/inst/tests/test_ccm.R b/inst/tests/test_ccm.R
@@ -4,7 +4,8 @@ data("sardine_anchovy_sst")
 anchovy_xmap_sst <- ccm(sardine_anchovy_sst, E = 3, 
                         lib_column = "anchovy", target_column = "np_sst", 
                         lib_sizes = seq(10, 80, by = 10), num_samples = 100, 
-                        RNGseed = 42)
+                        RNGseed = 42, 
+                        silent = TRUE)
 
 a_xmap_t_means <- ccm_means(anchovy_xmap_sst)
 plot(a_xmap_t_means$lib_size, a_xmap_t_means$rho, type = "l", 

diff --git a/inst/tests/test_lnlp.R b/inst/tests/test_lnlp.R
@@ -9,8 +9,9 @@ ts <- two_species_model$x[1:200]
 simplex(ts, lib = c(1, 100), pred = c(101, 200))
 
 # univariate s-map using E = 2
-s_map(ts, E = 2)
+s_map(ts, E = 2, silent = TRUE)
 
 # univariate s-map using E = 2, theta = 1, 
 # and full output with smap_coefficients
-y <- s_map(ts, E = 2, theta = 1, save_smap_coefficients = TRUE)
+y <- s_map(ts, E = 2, theta = 1, save_smap_coefficients = TRUE, 
+           silent = TRUE)
diff --git a/inst/tests/test_multiview.R b/inst/tests/test_multiview.R
@@ -5,11 +5,13 @@ data("block_3sp")
 block <- block_3sp[, c(2, 5, 8)]
 
 # do multiview using top 1, top 3, top k = sqrt(m) embeddings
-multiview(block, k = c(1, 3, "sqrt"))
+multiview(block, k = c(1, 3, "sqrt"), 
+          silent = TRUE)
 
 # same, but save output for plotting
 output <- multiview(block, k = c(1, 3, "sqrt"), 
-          stats_only = FALSE)
+          stats_only = FALSE, 
+          silent = TRUE)
 
 plot(output$model_output[[3]]$obs, 
      output$model_output[[3]]$pred)
diff --git a/inst/tests/test_simplex_calculation.R b/inst/tests/test_simplex_calculation.R
@@ -0,0 +1,63 @@
+library(rEDM)
+
+ts <- c(-0.056531409251883, 0.059223778257432, 5.24124928046977, -4.85399581474521, 
+        -0.46134818068973, 0.273317575696793, 0.801806230470337, -0.888891901824982, 
+        -0.202777622745051, 0.497565422757662, 5.10219324323769, -5.36826459373178, 
+        -0.17467165498718, 1.06545333399298, 1.97419279178678, -2.91448405223082, 
+        -0.179969867858605, 0.237962494942611, 1.47828327622468, -1.54267507064286, 
+        -0.180342027136338, 0.238919610831881, 1.06140368490958, -1.06522901782019, 
+        -0.214923527940395, 0.452847221462308, 2.13053391555372, -2.55145224744286, 
+        -0.0307653352327702, 1.1448014288826,-0.0675575239486375, -1.04711881585576, 
+        -0.00910890042051652, 0.726257323277433, 0.732271192186161, -1.35460378982395, 
+        -0.0322955446760023, 0.507606440290776, 3.73396587274012, -4.19686615950143, 
+        -0.0997201857962038, 0.753392632401029, 2.41347231553437, -3.03677401452137, 
+        -0.141112562089696, 0.446002103079665, 0.223768504955365, -0.615452831633047, 
+        -0.0216659723974975, 0.292246351104258, 0.20006105300258, -0.469596514211075, 
+        0.0422676544887819, 0.474264989176278, -0.0416811459395667, -0.53555712696719, 
+        0.118860281628173, 0.176335117268894,-0.10364820567334, -0.153572235117542, 
+        0.180339482186409, 0.0566876206447625, -0.140537892644139, 0.0252441742388871, 
+        0.340689505466622, 0.852833653689839, -1.07051231019616, -0.0937704380137284, 
+        0.460677118593916, 0.37444382348273, -0.83783628206217, -0.0154896108244113, 
+        1.34259279914848, -0.495978821807168, -0.472464634960208, -0.415481769949074, 
+        1.36767605087962, -0.891896943918948,-0.279228283931612, -0.148703043863421, 
+        2.04524590138255,-1.98431486665908, 0.0602356391036573, -0.0902730939678147, 
+        0.243344379963862, -0.074421904114315, -0.309150440565139, 0.43675531763949, 
+        0.178787692802827, 0.0799271040758849, -0.657946157906476, 1.14668210755046, 
+        -0.791665479471326, 0.482533897248175, -0.798737571552661, 0.439024256063545,
+        0.177114631209318, 2.19942374686687, -2.9488856529422)
+
+# construct lagged block
+lag_block <- cbind(c(ts[2:length(ts)], NA), ts, c(NA, ts[1:(length(ts)-1)]))
+t <- c(2:63, 65:99)
+
+# lib and pred portions
+lib_block <- cbind(t+1, lag_block[t,])
+pred_block <- cbind(65, lag_block[64, , drop = FALSE])
+
+block <- rbind(lib_block, pred_block)
+
+# make EDM forecast
+out <- block_lnlp(block, 
+                  lib = c(1, NROW(lib_block)), 
+                  pred = c(NROW(lib_block)+1, NROW(lib_block)+1), 
+                  first_column_time = TRUE, 
+                  tp = 0, 
+                  columns = c(2, 3), target_column = 1, 
+                  stats_only = FALSE, silent = TRUE)
+model_est <- out$model_output[[1]]$pred
+
+# manually calculate distances and neighbors
+dist_mat <- as.matrix(dist(block[, 3:4]))
+dist_vec <- dist_mat[NROW(dist_mat),]
+dist_vec[length(dist_vec)] <- NA
+nn <- order(dist_vec)[1:3] # 3 closest neighbors
+weights = exp(- dist_vec[nn] / dist_vec[nn[1]]) 
+est = sum(weights * block[nn, 2]) / sum(weights) # weighted average
+
+testthat::test_that("package prediction is same as manual calculation", {
+    testthat::expect_equal(model_est, est)
+})
+
+
+
+
diff --git a/inst/tests/test_tde_gp.R b/inst/tests/test_tde_gp.R
@@ -6,7 +6,8 @@ ts <- two_species_model$x[1:200]
 
 # univariate forecasting with Gaussian processes and E = 1:10, using maximum 
 #　likelihood to estimate params over library points
-out <- tde_gp(ts)
+out <- tde_gp(ts, 
+              silent = TRUE)
 
 # univariate forecasting with Gaussian processes and E = 5, and first half to 
 # predict second half

diff --git a/src/forecast_machine.cpp b/src/forecast_machine.cpp
@@ -185,19 +185,27 @@ std::vector<size_t> ForecastMachine::find_nearest_neighbors(const vec& dist)
     else
     {
         size_t i;
-        nearest_neighbors.push_back(which_lib[0]);
         for(auto curr_lib: which_lib)
         {
+            // distance to current neighbor under examination
             curr_distance = dist[curr_lib];
-            if(curr_distance <= dist[nearest_neighbors.back()])
+
+            // We want to include the current neighbor:
+            //   if haven't populated neighbors vector, or
+            //   if current neighbor is nearer than farthest away neighbor
+            if(nearest_neighbors.size() < nn || 
+               curr_distance <= dist[nearest_neighbors.back()])
             {
+                // find the correct place to insert the current neighbor
                 i = nearest_neighbors.size();
                 while((i > 0) && (curr_distance < dist[nearest_neighbors[i-1]]))
                 {
                     i--;
                 }
                 nearest_neighbors.insert(nearest_neighbors.begin()+i, curr_lib);
 
+                // if we've added too many neighbors and there isn't a tie, then
+                // pop off the farthest neighbor
                 if((nearest_neighbors.size() > nn) &&
                    (dist[nearest_neighbors[nn-1]] < dist[nearest_neighbors.back()]))
                 {
@@ -454,7 +462,6 @@ void ForecastMachine::simplex_prediction(const size_t start, const size_t end)
             nearest_neighbors = find_nearest_neighbors(distances[curr_pred]);
         }
         effective_nn = nearest_neighbors.size();
-
         if(effective_nn == 0)
         {
             predicted[curr_pred] = qnan;
@@ -483,7 +490,7 @@ void ForecastMachine::simplex_prediction(const size_t start, const size_t end)
                                   min_weight);
             }
         }
-        
+
         // identify ties and adjust weights
         if(effective_nn > nn) // ties exist
         {

diff --git a/vignettes/rEDM-tutorial.Rmd b/vignettes/rEDM-tutorial.Rmd
@@ -72,7 +72,9 @@ There are many applications for using this approach to recover system dynamics f
 
 ## Nearest Neighbor Forecasting using Simplex Projection
 
-As mentioned previously, the reconstruction will map one-to-one to the original attractor manifold if enough lags are used (i.e. if the reconstruction has a sufficiently large embedding dimension). If the embedding dimension is too small, then reconstructed states will overlap and appear to be the same even though they actually correspond to different states. These "singularities" will result in poor forecast performance, because the system behavior cannot be uniquely determined in the reconstruction. Thus, we can use prediction skill as an indicator for identifying the optimal embedding dimension. In the following example, we demonstrate how this can be accomplished using a nearest neighbor forecastaing method, Simplex Projection [@Sugihara_1990], implemented in **rEDM** as the function `simplex`.
+As mentioned previously, the reconstruction will map one-to-one to the original attractor manifold if enough lags are used (i.e. if the reconstruction has a sufficiently large embedding dimension). If the embedding dimension is too small, then reconstructed states may overlap and appear to be the same even though they actually correspond to different *true* states of the system. These "singularities" will result in poor forecast performance, because the system behavior cannot be uniquely determined everywhere. Thus, we can use prediction skill as an indicator for the optimal embedding dimension.
+
+In the following example, we demonstrate how this can be accomplished using a nearest neighbor forecasting method, Simplex Projection [@Sugihara_1990], to produce forecasts, and using the correlation between observed and predicted values as the measure of prediction skill.
 
 ### Example
 
@@ -101,11 +103,11 @@ lib <- c(1, 100)
 pred <- c(201, 500)
 ```
 
-We begin by initializing the `lib` and `pred` variables. These determine which portions of the data will be used to create the reconstruction, and which portions of the data the reconstruction will be used to make forecasts on. In other words, defining the "training" and "test" subsets of the data. Here, the first 100 points (rows 1 to 100) in the time series constitute the "library", and 300 points (rows 201 to 500) are the "prediction set" for which the model will produce forecasts.
+We begin by initializing the `lib` and `pred` variables. These determine which portions of the data will be used to create the reconstruction, and which portions will be used to make forecasts on. In other words, defining the "training" and "test" subsets of the data. Here, the first 100 points (rows 1 to 100) in the time series constitute the "library", and 300 points (rows 201 to 500) are the "prediction set" over which the model will produce forecasts.
 
 The remaining parameters will be left at their default values (see section \ref{sec_general_parameters} for details). For the `simplex` function, this means that the embedding dimension, $E$, will range from $1$ to $10$.
 
-*Note that if the code detects any overlap in the lib and pred, it will enable leave-one-out cross-validation and return a warning message.*
+*Note that if the code detects any overlap in the lib and pred, it will enable leave-one-out cross-validation and produce a warning message.*
 
 ```{r simplex on tentmap}
 simplex_output <- simplex(ts, lib, pred)