This report is the reproduction of the analyses shown in Ward, E.J., Holmes, E.E., Thorson, J.T. & Collen, B. (2014) Complexity is costly: A meta-analysis of parametric and non-parametric methods for short-term population forecasting. Oikos, 123, 652–661.
Eric Ward kindly provided a subset of the raw data, processed data as well as the R scripts used for model fitting; they can be found in the respective folders. His original repository can be found here. Because the data originates from a collaborative project, not all data is freely available yet; however, the data provided by Eric allows to reproduce the analyses on the fish time-series shown in the paper.
##
## Attaching package: 'lubridate'
##
## The following object is masked from 'package:plyr':
##
## here
##
## Loading required package: bitops
## Loading required package: nlme
## This is mgcv 1.8-6. For overview type 'help("mgcv-package")'.
## Nonparametric Kernel Methods for Mixed Datatypes (version 0.60-2)
## [vignette("np_faq",package="np") provides answers to frequently asked questions]
## locfit 1.5-9.1 2013-03-22
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Loading required package: timeDate
## This is forecast 6.1
##
##
## Attaching package: 'forecast'
##
## The following object is masked from 'package:nlme':
##
## getResponse
##
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
Read time series data provided from repository:
ts <- read.csv(text=getURL("https://raw.githubusercontent.com/opetchey/RREEBES/WARD_etal_2014_Oikos/WARD_etal_2014_Oikos/processed%20data/masterDat%20052015.csv"), header=T, stringsAsFactors=F)
metainfo <- read.csv(text=getURL("https://raw.githubusercontent.com/opetchey/RREEBES/WARD_etal_2014_Oikos/WARD_etal_2014_Oikos/processed%20data/Data%20and%20metadata%20052015.csv"), header=T, stringsAsFactors=F)
#look at data structure
str(ts)
## 'data.frame': 53130 obs. of 8 variables:
## $ X : int 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 ...
## $ ID : int 62 62 62 62 62 62 62 62 62 62 ...
## $ Database: chr "salmon" "salmon" "salmon" "salmon" ...
## $ Spcode : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Year : int 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 ...
## $ Species : chr "Chinook" "Chinook" "Chinook" "Chinook" ...
## $ Class : chr "Actinopterygii" "Actinopterygii" "Actinopterygii" "Actinopterygii" ...
## $ Value : num 10 10.1 10.2 10.3 9.1 ...
head(ts)
## X ID Database Spcode Year Species Class Value
## 1 2380 62 salmon 1 1951 Chinook Actinopterygii 9.998798
## 2 2381 62 salmon 1 1952 Chinook Actinopterygii 10.126631
## 3 2382 62 salmon 1 1953 Chinook Actinopterygii 10.239960
## 4 2383 62 salmon 1 1954 Chinook Actinopterygii 10.275051
## 5 2384 62 salmon 1 1955 Chinook Actinopterygii 9.104980
## 6 2385 62 salmon 1 1956 Chinook Actinopterygii 8.294050
# number of fish ts
length(unique(ts$ID))
## [1] 1266
ts$ID_old <- as.factor(ts$ID)
ts$ID_new <- as.numeric(ts$ID_old)
ts$ID <- ts$ID_new
Explore some time series visually to get an idea of the variability. Figure does not appear in the paper. Time series were centered to easily plot them simultaneousl [e.g. Year-Mean(Year) and Value - Mean(Value)].
set.seed(12345678)
ggplot(data=subset(ts, ID %in% sample(ts$ID,size=15)), aes(x=Year-mean(Year), y=Value-mean(Value))) + geom_line() + facet_wrap(ID~Species,ncol=5,nrow=3, scales="free")
The forecasting script was provided but required some adjusting to run on the available data. It is not shown here for its length. Running the script will produce forecasts based on the training data (all data points except the five last data points). This script takes considerable time to run as it loops over each time series and fits the 49 different model parameterizations. An object containing the output of the full run can be found here: https://www.dropbox.com/s/p1j16tgzlrwrr7m/Ward_et_al_results.RData?dl=0
MARSS failed on a couple of time series (IDs: 430, 514, 515, 590, 1234) which were hence excluded.
Implementation of the absolute scaled error (ASE) for a single time-series. The absolute error is scaled by the mean absolute error within the training data to account for the different variability per time series. ASE is calculated for the different forecasting methods.
get.ASE <- function(x){
# x is the data used in ddply for each "ID_new" and "model_type"
#x <- subset(prediction, ID_new==1 & model_type == "GAM (gam)")
training_dd <- x[1:(nrow(x)-5),]
mean.abs.error <-( 1/(nrow(training_dd)-1) ) * sum(abs(training_dd[2:nrow(training_dd),"Value"] - training_dd[1:nrow(training_dd)-1,"Value"]))
res_ASE <- rep(NA,nrow(x))
step_ahead <- rep(NA,nrow(x))
for (t in 1:5){
res_ASE[nrow(training_dd)+t] <- abs(x[nrow(training_dd)+t,"Value"]-x[nrow(training_dd)+t,"Value_predicted"]) / mean.abs.error
step_ahead[nrow(training_dd)+t] <- t
}
#res_ASE[(nrow(x)-4):nrow(x)] <- sapply(1:5, function(t){abs(x[nrow(training_dd)+t,"Value"]-x[nrow(training_dd)+t,"Value_predicted"]) / mean.abs.error})
cbind.data.frame(x,ASE=res_ASE, step_ahead=step_ahead)
}
#out put with ASE estimate
prediction_ASE <- ddply(.data=prediction, c("ID_new","model_type"), get.ASE)
#save.image("/Volumes/Work MAC backup/Ward_et_al_results.RData")
Prepare data for visualization by re-ordering factor levels. For some of the models it is not entirely clear which of the different specifications was taken.
Visualize overall forecasting performance using the 13 models specified in the caption. The plot in the manuscript shows only 12 models! Currently it is hard to determine the exact grouping used (e.g. salmon, marine fish productivity) because there is no corresponding label shipped with the data. Nevertheless, the plots pretty much look like Figure 1 from the paper!
mean_ASE <- ddply(models_selected, .(Database, model_type, data_dist, step_ahead), summarize, mean_log_ASE = log(mean(ASE, na.rm = T)))
ggplot(data=subset(mean_ASE, step_ahead<5 & mean_log_ASE<60 & (Database == "salmon" | Database == "RAMlegacy.recperssb")), aes(x=model_type,y=mean_log_ASE, group=step_ahead, linetype=as.factor(step_ahead), size=as.factor(step_ahead), shape=data_dist)) + geom_point(size=2) + geom_line() +
theme(axis.text.x = element_text(angle=90, vjust=0.5, size=16)) + scale_linetype_manual(values=c("solid", "dashed", "dotted", "solid")) +
scale_size_manual(values=c(0.8,0.8,0.8,0.4)) + scale_shape_manual(values=c(2,1)) + facet_wrap(~Database)
# mean_ASE_salmon_all <- ddply(subset(models_selected, Species %in% c("Chinook", "Chum", "Pink", "Coho", "Sockeye")), .(model_type, step_ahead), summarize, mean_log_ASE = log(mean(ASE, na.rm = T)))
#
# ggplot(data=subset(mean_ASE_salmon_all, step_ahead<5), aes(x=model_type,y=mean_log_ASE, group=as.factor(step_ahead), linetype=as.factor(step_ahead), size=as.factor(step_ahead))) + geom_point() + geom_line() + theme(axis.text.x = element_text(angle=90, vjust=0.5, size=16)) + scale_size_manual(values=c(0.8,0.8,0.8,0.4)) + scale_linetype_manual(values=c("solid", "dashed", "dotted", "solid"))
Plot MASE for the different species of salmon separately. No “Coho” in the current dataset, hence no plot produced. Nevertheless, plots pretty similar to what was shown in the paper.
mean_ASE_salmon <- ddply(subset(models_selected, Species %in% c("Chinook", "Chum", "Pink", "Coho", "Sockeye")), .(Species, model_type, data_dist, step_ahead), summarize, mean_log_ASE = log(mean(ASE, na.rm = T)))
mean_ASE_salmon$Species <- factor(mean_ASE_salmon$Species ,levels(as.factor(mean_ASE_salmon$Species))[c(3,2,4,5,1)])
ggplot(data=subset(mean_ASE_salmon, step_ahead<5), aes(x=model_type,y=mean_log_ASE, group=as.factor(step_ahead), linetype=as.factor(step_ahead), size=as.factor(step_ahead), shape=data_dist)) + geom_point(size=2) + geom_point() + geom_line() + theme(axis.text.x = element_text(angle=90, vjust=0.5, size=16)) + facet_wrap(~Species) + scale_linetype_manual(values=c("solid", "dashed", "dotted", "solid")) + scale_size_manual(values=c(0.8,0.8,0.8,0.4)) + scale_shape_manual(values=c(2,1))
Based on meta information data set. This is pretty approximate, but ok as it was not the main target of the reproduction.
ggplot(subset(metainfo, Database %in% c("salmon" , "RAMlegacy.recperssb")), aes(x=training.acf)) + geom_histogram() + facet_wrap(~Database) + geom_vline(xintercept = 0, linetype="dashed", size=2)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.