Introduction
Data processing
Results
- Time series forecasting
- Calculation of mean absolute scaled error
Figures

Introduction

This report is the reproduction of the analyses shown in Ward, E.J., Holmes, E.E., Thorson, J.T. & Collen, B. (2014) Complexity is costly: A meta-analysis of parametric and non-parametric methods for short-term population forecasting. Oikos, 123, 652–661.

Eric Ward kindly provided a subset of the raw data, processed data as well as the R scripts used for model fitting; they can be found in the respective folders. His original repository can be found here. Because the data originates from a collaborative project, not all data is freely available yet; however, the data provided by Eric allows to reproduce the analyses on the fish time-series shown in the paper.

Data processing

## 
## Attaching package: 'lubridate'
## 
## The following object is masked from 'package:plyr':
## 
##     here
## 
## Loading required package: bitops
## Loading required package: nlme
## This is mgcv 1.8-6. For overview type 'help("mgcv-package")'.
## Nonparametric Kernel Methods for Mixed Datatypes (version 0.60-2)
## [vignette("np_faq",package="np") provides answers to frequently asked questions]
## locfit 1.5-9.1    2013-03-22
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Loading required package: timeDate
## This is forecast 6.1 
## 
## 
## Attaching package: 'forecast'
## 
## The following object is masked from 'package:nlme':
## 
##     getResponse
## 
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

Read time series data provided from repository:

ts <- read.csv(text=getURL("https://raw.githubusercontent.com/opetchey/RREEBES/WARD_etal_2014_Oikos/WARD_etal_2014_Oikos/processed%20data/masterDat%20052015.csv"), header=T, stringsAsFactors=F)
metainfo <- read.csv(text=getURL("https://raw.githubusercontent.com/opetchey/RREEBES/WARD_etal_2014_Oikos/WARD_etal_2014_Oikos/processed%20data/Data%20and%20metadata%20052015.csv"), header=T, stringsAsFactors=F)

#look at data structure
str(ts)

## 'data.frame':    53130 obs. of  8 variables:
##  $ X       : int  2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 ...
##  $ ID      : int  62 62 62 62 62 62 62 62 62 62 ...
##  $ Database: chr  "salmon" "salmon" "salmon" "salmon" ...
##  $ Spcode  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Year    : int  1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 ...
##  $ Species : chr  "Chinook" "Chinook" "Chinook" "Chinook" ...
##  $ Class   : chr  "Actinopterygii" "Actinopterygii" "Actinopterygii" "Actinopterygii" ...
##  $ Value   : num  10 10.1 10.2 10.3 9.1 ...

head(ts)

##      X ID Database Spcode Year Species          Class     Value
## 1 2380 62   salmon      1 1951 Chinook Actinopterygii  9.998798
## 2 2381 62   salmon      1 1952 Chinook Actinopterygii 10.126631
## 3 2382 62   salmon      1 1953 Chinook Actinopterygii 10.239960
## 4 2383 62   salmon      1 1954 Chinook Actinopterygii 10.275051
## 5 2384 62   salmon      1 1955 Chinook Actinopterygii  9.104980
## 6 2385 62   salmon      1 1956 Chinook Actinopterygii  8.294050

# number of fish ts
length(unique(ts$ID))

## [1] 1266

ts$ID_old <- as.factor(ts$ID)
ts$ID_new <- as.numeric(ts$ID_old)
ts$ID <- ts$ID_new

Explore some time series visually to get an idea of the variability. Figure does not appear in the paper. Time series were centered to easily plot them simultaneousl [e.g. Year-Mean(Year) and Value - Mean(Value)].

set.seed(12345678)
ggplot(data=subset(ts, ID %in% sample(ts$ID,size=15)), aes(x=Year-mean(Year), y=Value-mean(Value))) + geom_line() + facet_wrap(ID~Species,ncol=5,nrow=3, scales="free")

Results

Time series forecasting

The forecasting script was provided but required some adjusting to run on the available data. It is not shown here for its length. Running the script will produce forecasts based on the training data (all data points except the five last data points). This script takes considerable time to run as it loops over each time series and fits the 49 different model parameterizations. An object containing the output of the full run can be found here: https://www.dropbox.com/s/p1j16tgzlrwrr7m/Ward_et_al_results.RData?dl=0

MARSS failed on a couple of time series (IDs: 430, 514, 515, 590, 1234) which were hence excluded.

Calculation of mean absolute scaled error

Implementation of the absolute scaled error (ASE) for a single time-series. The absolute error is scaled by the mean absolute error within the training data to account for the different variability per time series. ASE is calculated for the different forecasting methods.

get.ASE <- function(x){
  # x is the data used in ddply for each "ID_new" and "model_type"
  #x <- subset(prediction, ID_new==1 & model_type == "GAM (gam)")
  training_dd <- x[1:(nrow(x)-5),]
  mean.abs.error <-( 1/(nrow(training_dd)-1) ) * sum(abs(training_dd[2:nrow(training_dd),"Value"] - training_dd[1:nrow(training_dd)-1,"Value"]))
  
  res_ASE <- rep(NA,nrow(x)) 
  step_ahead <- rep(NA,nrow(x))
  for (t in 1:5){
    res_ASE[nrow(training_dd)+t] <- abs(x[nrow(training_dd)+t,"Value"]-x[nrow(training_dd)+t,"Value_predicted"]) / mean.abs.error
    step_ahead[nrow(training_dd)+t] <- t
  }
  
  #res_ASE[(nrow(x)-4):nrow(x)] <- sapply(1:5, function(t){abs(x[nrow(training_dd)+t,"Value"]-x[nrow(training_dd)+t,"Value_predicted"]) / mean.abs.error})
    
  cbind.data.frame(x,ASE=res_ASE, step_ahead=step_ahead)
}

#out put with ASE estimate
prediction_ASE <- ddply(.data=prediction, c("ID_new","model_type"), get.ASE)
#save.image("/Volumes/Work MAC backup/Ward_et_al_results.RData")

Figures

Prepare data for visualization by re-ordering factor levels. For some of the models it is not entirely clear which of the different specifications was taken.

Reproducing Figure 1

Visualize overall forecasting performance using the 13 models specified in the caption. The plot in the manuscript shows only 12 models! Currently it is hard to determine the exact grouping used (e.g. salmon, marine fish productivity) because there is no corresponding label shipped with the data. Nevertheless, the plots pretty much look like Figure 1 from the paper!

mean_ASE <- ddply(models_selected, .(Database, model_type, data_dist, step_ahead), summarize, mean_log_ASE = log(mean(ASE, na.rm = T)))
       
ggplot(data=subset(mean_ASE, step_ahead<5 & mean_log_ASE<60 & (Database == "salmon" | Database == "RAMlegacy.recperssb")), aes(x=model_type,y=mean_log_ASE, group=step_ahead, linetype=as.factor(step_ahead), size=as.factor(step_ahead), shape=data_dist)) + geom_point(size=2) + geom_line() + 
  theme(axis.text.x  = element_text(angle=90, vjust=0.5, size=16)) + scale_linetype_manual(values=c("solid", "dashed", "dotted", "solid")) + 
  scale_size_manual(values=c(0.8,0.8,0.8,0.4))  + scale_shape_manual(values=c(2,1)) + facet_wrap(~Database)

# mean_ASE_salmon_all <- ddply(subset(models_selected, Species %in% c("Chinook", "Chum", "Pink", "Coho", "Sockeye")), .(model_type, step_ahead), summarize, mean_log_ASE = log(mean(ASE, na.rm = T)))
# 
# ggplot(data=subset(mean_ASE_salmon_all, step_ahead<5), aes(x=model_type,y=mean_log_ASE, group=as.factor(step_ahead), linetype=as.factor(step_ahead), size=as.factor(step_ahead))) + geom_point() + geom_line() + theme(axis.text.x  = element_text(angle=90, vjust=0.5, size=16)) + scale_size_manual(values=c(0.8,0.8,0.8,0.4)) + scale_linetype_manual(values=c("solid", "dashed", "dotted", "solid"))

Reproducing Figure 2

Plot MASE for the different species of salmon separately. No “Coho” in the current dataset, hence no plot produced. Nevertheless, plots pretty similar to what was shown in the paper.

mean_ASE_salmon <- ddply(subset(models_selected, Species %in% c("Chinook", "Chum", "Pink", "Coho", "Sockeye")), .(Species, model_type, data_dist, step_ahead), summarize, mean_log_ASE = log(mean(ASE, na.rm = T)))
mean_ASE_salmon$Species <- factor(mean_ASE_salmon$Species ,levels(as.factor(mean_ASE_salmon$Species))[c(3,2,4,5,1)])

ggplot(data=subset(mean_ASE_salmon, step_ahead<5), aes(x=model_type,y=mean_log_ASE, group=as.factor(step_ahead), linetype=as.factor(step_ahead), size=as.factor(step_ahead), shape=data_dist)) + geom_point(size=2) + geom_point() + geom_line() + theme(axis.text.x  = element_text(angle=90, vjust=0.5, size=16)) + facet_wrap(~Species) + scale_linetype_manual(values=c("solid", "dashed", "dotted", "solid")) + scale_size_manual(values=c(0.8,0.8,0.8,0.4)) + scale_shape_manual(values=c(2,1))

No reproduction of Figure 3

Partial reproduction of Figure 4

Based on meta information data set. This is pretty approximate, but ok as it was not the main target of the reproduction.

ggplot(subset(metainfo, Database %in% c("salmon" , "RAMlegacy.recperssb")), aes(x=training.acf)) + geom_histogram() + facet_wrap(~Database) + geom_vline(xintercept = 0, linetype="dashed", size=2)

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Reproduce analyses of Ward et al (2014): Complexity is costly

Aurélie Garnier, Frank Pennekamp, Mikael Pontarp

1 Jun 2015