Introduction:

Hsieh et al. 2008 propose a method to detect non-linearity in ecological time series using the simplex and s-map projections developed by Sugihara and co-workers. Generally, non-linear time series analysis requires long-time series to detect signals. The approach by Hsieh et al. combines individual time series of ecologically similar species into one long time series to which simplex and s-maps can be applied (termed Dewdrop regression by the authors). To illustrate their approach, a five species competition model was used to generate time series and the ability of the method to get back the original parameter values of the simulation tested under various levels of noise and lengths of the time series.

The goals for this reproduction are: * set up a simulation model of multi-species competition * understanding simplex and s-maps as forecasting tools using the recently released rEDM package. * understand the approach taken to combine several short time series into one long time series to detect non-linearity and make useful forecasts

Step 1: Simulate 5 species dynamics (as described in detail in Appendix B):

First we reproduced the dynamics of the 5 species community as follows:

The following R code implements the simulation model:

# This code implements a multispecies competition model based on the logistic map (https://en.wikipedia.org/wiki/Logistic_map)
# the equilibrium density depends on the growth rate (steady state at values between 1 and 2, extinction below 1, chaos > 3.7, fluctuations > 3)

# set seed
set.seed(190981)

## Specify the number of species
S <- 5

#growth rate in the eight-mode oscillation regime below the threshold to chaos
r <- 3.57

# specify alphas
a <- 0.05

# simulate 1000 time steps
t=seq(1,1000, by=1)

# specify matrix to hold results
N <- matrix(nrow=S, ncol=length(t))

# set initial conditions randomly
N[,1] <- runif(S)

# simulate dynamics using the discrete logistic growth model; add process noise at the end
for (i in 1:(length(t)-1)){
  
  for (j in 1:5){

    N[j,i+1] = (r * N[j,i] * (1- N[j,i]) - a*sum(N[-j,i]) ) * runif(1,0.85,1.15)
    
  }
}

# add some observation noise to time series
N <- N * runif(length(N), 0.95, 1.05)

output <- as.data.frame(t(N[1:S,]))
names(output) <- paste0("species ", 1:S)

# check for correlation in abundances
#plot(output$`species 1`, output$`species 5`)

#write.csv(output, "/Users/Frank/Documents/Git projects/RREEBES/Hsieh_et_al_2008/data/5sp_comp_sim_data.csv", row.names=F)

five_sp_comp <- read.csv("/Users/Frank/Documents/Git projects/RREEBES/Hsieh_et_al_2008/data/5sp_comp_sim_data.csv")
#names(five_sp_comp) <- c("time", paste0("species.", 1:5))
five_sp_comp$time <- seq(1,1000)

five_sp_comp <- five_sp_comp[,c(6,1:5)]

Plot shows time series for the 5 species competition model.

# plot results
matplot(t,t(N[1:S,]), "l",ylim=c(0,1), xlim=c(0,100),col=2:5)

Are species dynamically similar? Look at state-space plot of the different species:

One of the assumptions of the method is that composite time series originate from species with similar dynamics. We hence checked that dynamics of two species look similar with a state-space plot using the embeddings of the time series, which is just a sanity check, as the model was setup to simulate species with the same dynamics (same r and alphas) but obscured dynamics due to process and observation noise.

Embed <- function(X, dimension=4) {
  E_X <- matrix(NA, length(X)-dimension, dimension)
  for(i in 1:dimension)
    E_X[,i] <- X[(1+i-1):(length(X)-dimension+i-1)]
  E_X
}


library(plot3D)

embed_sp5 <- Embed(five_sp_comp[,c(2)], dimension=3)
colnames(embed_sp5) <- paste0("embed",1:3)
points3D(x = embed_sp5[,1],
       y = embed_sp5[,2],
       z = embed_sp5[,3],  col="blue")

embed_sp5 <- Embed(five_sp_comp[,c(3)], dimension=3)
points3D(x = embed_sp5[,1],
       y = embed_sp5[,2],
       z = embed_sp5[,3],  col="red", add=T)

The two species have indeed a similar geometric shape in the state-space plot

Next we try to reproduce figure 1 from the publication. Panel A uses a 4th order AR model fitted to species 5 (to get coefficients) to predict the dynamics of species 5. The following R chunk fits the AR(4) model:

# fig 1 A
#plot(five_sp_comp[,3], five_sp_comp[,5], xlim=c(0,1.1), ylim=c(0,1.1))

# fit AR(4) model
ar_mod <-  ar(as.ts(five_sp_comp[,c(5)]), FALSE, 4)
#str(ar_mod)

ar_mod$ar
## [1] -0.600709567 -0.020754073 -0.092533114  0.003297379

Now make predictions for species 3 based on species 5. We use this information later when compiling figure 1.

sp3_ar_prediction <- vector("numeric", 1000)
# predict species 3 based on species 5
for (t in 5:1000){

sp3_ar_prediction[t] = ar_mod$x.mean + ar_mod$ar[1] * five_sp_comp[(t-1),3] + ar_mod$ar[2] * five_sp_comp[(t-2),3]  + ar_mod$ar[3] * five_sp_comp[(t-3),3] + ar_mod$ar[4] * five_sp_comp[(t-4),3]

}

#plot(sp3_ar_prediction, five_sp_comp[,3])

Use rEDM package to make predictions

The following part uses the rEDM package to make predictions with the simplex and s-map projection methods. For quick walkthroughs of simplex projection and the related convergent crossmappings, please have a look to the following RREEBES materials:
- http://opetchey.github.io/RREEBES/Sugihara_and_May_1990_Nature/Simplex_projection_walkthrough.html
- http://opetchey.github.io/RREEBES/Sugihara_etal_2012_Science/Sugihara_2012_Science_Fig3A.html

Step 2: Apply simplex projection from rEDM package to simulated time series (predicting species 5)

First, let’s predict dynamics of species 5 with information from its own time series. The goal is to predict last 15 time steps of species 1 based on previous 985 time steps of same species. The simplex method is often used to determine the embedding dimension (because it is less prone to overfitting) and use that parameter in a subsequent s-map projection.

The first let’s have a look at the mbedding dimensions. We select the dimension where forecast skill peaks.

plot(simplex_mod_stats_sp5_1$E, simplex_mod_stats_sp5_1$rho, type="l", xlab = "Embedding dimension (E)", ylab="Forecast skill (rho)")

Now let’s look at the actual forecasting skill by looking at the overlay of predicted and observed abundances:

# compare data with predictions visually
plot(five_sp_comp$time, five_sp_comp$species.5, "l", ylim=c(0,1), xlim = c(900,1000), xlab="time step", ylab="abundance species 5")
par(new=T)
plot(simplex_mod_pred_sp5_2[[3]]$model_output$time, simplex_mod_pred_sp5_2[[3]]$model_output$pred, "l", lty = 2, col="red", ylim=c(0,1), xlim = c(900,1000), xlab="time step", ylab="abundance species 5")

Looks pretty good, the previous plots shows us that we have a correlation coefficient about 0.8 for observed and predicted.

Step 3: apply s-maps to simulated time series with best embedding dimension as determined by simplex (predicting species 5)

Next we want to use the more sophisticated s-maps, which can also be used to test for non-linearity of time series. As the forecast skill was about the same for dimensions 1 to 4, and in the paper a D of 4 is suggested, we use that as our setting for the call to the smap function.

# now use smap with E dimensions (identified with Simplex)
smap_out <- s_map(five_sp_comp$species.5, lib = c(1, 985), pred = c(986,NROW(five_sp_comp$species.5)), stats_only = T,
                  norm_type = c("L2 norm"), E = 4)
#smap_out

# plot theta against forecasting skill
plot(smap_out$theta, smap_out$rho, type="l", xlab = "Non-linearity", ylab="Forecast skill (rho)")

smap_pred <- s_map(five_sp_comp$species.5, lib = c(1, 985), pred = c(986,NROW(five_sp_comp$species.5)), stats_only = F,
                  norm_type = c("L2 norm"), E = 4, theta = 4)

sp5_observations  <- smap_pred[[1]]$model_output$obs
sp5_predictions  <- smap_pred[[1]]$model_output$pred

#plot(sp5_observations, sp5_predictions)
#abline(0,1)

Step 4: test co-predictability (use library based on species 3 to predict species 5)

The paper proposes that dynamics of dynamically similar species can be predicted by extending the library of training data by pooling data from multiple species. This would be powerful as time series in ecology are often short. Real time series would be needed to scale and normalize to be stitched together, and “seams”" between time series of different species should be excluded. Because the data we deal with is simulated and the logistic map is anyways between 0 and 1, this step is unnecessary for the reproduced example.

To test the co-predictability of dynamics, the block_lnlp function is used. Embeddings need to be constructed in a manual fashion, but both simplex and s-maps can be fitted. Importantly, to achieve a good predictive skill, the original time series of the species to be predicted need to be included in the library of training data.

# predict species 3 based on embeddings in species 5
block_data <- five_sp_comp[,c(1,4,6)]
n <- NROW(block_data$species.5)
block_data$species.5m1 <- c(NA, block_data$species.5[-n])
block_data$species.5m2 <- c(NA, block_data$species.5m1[-n])
block_data$species.5m3 <- c(NA, block_data$species.5m2[-n])
block_data$species.5m4 <- c(NA, block_data$species.5m3[-n])

block <- block_lnlp(block_data, lib = c(1, 1000), pred = c(1,NROW(block_data)),
  norm_type = c("L2 norm"), 
  method = c("s-map"), tp = 1, num_neighbors = "0.0001", 
  columns = c("species.3",  "species.5m1", "species.5m2", "species.5m3", "species.5m4"), target_column = c("species.3"), stats_only = T, first_column_time = T)
## Warning in model$run(): Found overlap between lib and pred. Enabling cross-
## validation with exclusion radius = 0.
block
##       embedding tp     nn theta num_pred       rho        mae       rmse
## 1 1, 3, 4, 5, 6  1 0.0001     0      970 0.7607972 0.07574286 0.09698489
##   perc         p_val const_pred_num_pred const_pred_rho const_pred_mae
## 1    1 8.360683e-212                 995      -0.658924      0.2199712
##   const_pred_rmse const_pred_perc const_p_val
## 1       0.2626675               1           1
block <- block_lnlp(block_data, lib = c(1, 1000), pred = c(1,NROW(block_data)),
  norm_type = c("L2 norm"), 
  method = c("s-map"), tp = 1, num_neighbors = "0.0001", 
  columns = c("species.3", "species.5m1", "species.5m2", "species.5m3", "species.5m4"), target_column = c("species.3"), stats_only = F, first_column_time = T)
## Warning in model$run(): Found overlap between lib and pred. Enabling cross-
## validation with exclusion radius = 0.
sp3_observations  <- block[[1]]$model_output$obs
sp3_predictions  <- block[[1]]$model_output$pred

#plot(sp3_observations, sp3_predictions)
#abline(a=0,b=1)

We can now assemble Figure 1 from previous chunks:

par(mfrow=c(1,3))
plot(five_sp_comp[,3], five_sp_comp[,5], xlab="Species 3",ylab="Species 5",main="Cross-species Correlation")

plot(sp3_ar_prediction, five_sp_comp[,3],xlab="Prediction using Species 3",ylab="Species 3 (true value)",main="Cross-species AR Model")
abline(a=1, b=1)

plot(sp3_observations, sp3_predictions,xlab="Prediction using Species 5",ylab="Species 3 (true value)",main="Cross-species S-Map")
abline(a=0,b=1)

par(mfrow=c(1,1))

This looks pretty similar to the figure in the publication! Great.

Comparing predictive skill between short, long and composite (time series of dynamically similar species combined) time series

First we need to do some house-keeping to select the appropriate datasets to predict and to train the s-maps.

lib <- c(1,984)
pred <- c(985,999)
shlib <- c(685,700)
shpred <- c(701,715)


corr <- matrix(NA, ncol=8, nrow=5)
corr_sh  <- matrix(NA, ncol=8, nrow=5)
corr_com  <- matrix(NA, ncol=8, nrow=5)

win=700  #where to start in the time series
len=30

compred <- matrix(NA,nrow=5, ncol=2)
for(i in 1:5) {
    (1:5)[-i]->j
    compred[i,1:2]=c(i*1000-win-len/2,i*1000-win)
    comlib=matrix(c(i*1000-win-1,j*1000-win-len/2,i*1000-win+len/2,j*1000-win+len/2),nrow=5)
}

for (i in 1:5){
  simplex_mod_stats_sp <- simplex(five_sp_comp[,i+1], lib = lib, pred = pred,
                norm_type = c("L2 norm"), E = 1:8,
                tau = 1, tp = 1, num_neighbors = "e+1", stats_only = T,
                exclusion_radius = NULL, epsilon = NULL, silent = FALSE)
  corr[i,1:8] <- simplex_mod_stats_sp$rho 
}


for (i in 1:5){
  simplex_mod_stats_sp <- simplex(five_sp_comp[,i+1], lib = shlib, pred = shpred,
                norm_type = c("L2 norm"), E = 1:8,
                tau = 1, tp = 1, num_neighbors = "e+1", stats_only = T,
                exclusion_radius = NULL, epsilon = NULL, silent = FALSE)
  corr_sh[i,1:8] <- simplex_mod_stats_sp$rho 
}

library(reshape2)
t <- melt(five_sp_comp[,2:6])
## No id variables; using all as measure variables
t[1] <- 1:5000

for (i in 1:5){
  simplex_mod_stats_sp <- simplex(t[,2], lib = comlib, pred = compred[i,],
                norm_type = c("L2 norm"), E = 1:8,
                tau = 1, tp = 1, num_neighbors = "e+1", stats_only = T,
                exclusion_radius = NULL, epsilon = NULL, silent = FALSE)
  corr_com[i,1:8] <- simplex_mod_stats_sp$rho 
  
}
## Warning in model$run(): Found overlap between lib and pred. Enabling cross-
## validation with exclusion radius = 0.

## Warning in model$run(): Found overlap between lib and pred. Enabling cross-
## validation with exclusion radius = 0.

## Warning in model$run(): Found overlap between lib and pred. Enabling cross-
## validation with exclusion radius = 0.

## Warning in model$run(): Found overlap between lib and pred. Enabling cross-
## validation with exclusion radius = 0.

## Warning in model$run(): Found overlap between lib and pred. Enabling cross-
## validation with exclusion radius = 0.
corr_D <- as.data.frame(t(apply(corr,2,quantile,probs=c(.025,.5,.975))))
corr_D$type <- "Long time series N=1000" 
corr_sh_D <-as.data.frame(t(apply(corr_sh,2,quantile,probs=c(.025,.5,.975))))
corr_sh_D$type <- "Short timeseries N=30" 
corr_com_D <- as.data.frame(t(apply(corr_com,2,quantile,probs=c(.025,.5,.975))))
corr_com_D$type <- "Composite timeseries N = 5 * 30" 

corr_D$D <- corr_sh_D$D <- corr_com_D$D <-  1:8

corr_D_full <- rbind(corr_D, rbind(corr_sh_D, corr_com_D))

No we are able to plot the results of Figure 2:

library(ggplot2)

ggplot() + geom_line(data=corr_D_full, aes(x=D, y=`50%`)) +
  geom_line(data=corr_D_full, aes(x=D, y=`2.5%`), colour="blue", linetype="dashed") + 
  geom_line(data=corr_D_full, aes(x=D, y=`97.5%`), colour="blue", linetype="dashed") + facet_grid(.~type) + ylab("Correlation")

Looks pretty similar to figure 2, with full time series showing the best predictive skill across dimensionality 10, composite time series being pretty similar and only short time series suffering from a strong decay in predictive skill at higher dimensionalities.

Attemps to reproduce remainder of the paper (including Fig. 3) were abandoned as the three goals of the reproduction were achieved.