Refs:
http://a-little-book-of-r-for-time-series.readthedocs.org/en/latest/src/timeseries.html
http://www.statoek.wiso.uni-goettingen.de/veranstaltungen/zeitreihen/sommer03/ts_r_intro.pdf
Time series are different than usual dataseries because there usually contain periodic patterns (weekly, yearly…). To find these patterns its needed different types of analysis, since instead of assuming the sequence of observations does not matter, we are assuming that it matters, old observations help predict new ones.
Time series \(X_t\) usually have two or three components
Trend component \(T_t\)
Seasonal component \(S_t\)
Irregular or residual component \(e_t\)
A time series with seasonal component is called seasoned data (eg, sales per month). Some data series does not have a seasonal component (eg, a population mortality average).
R has a class time series named ts
:
my.data <- round( sin( (0:47 %% 12)+1 ) + runif(48) ,1) # some periodic data
# make a time series with 48 observations from January 2009 to December 2014
my.ts <- ts(my.data, start=c(2009, 1), end=c(2014, 12), frequency=12)
my.ts
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 2009 1.7 1.7 0.3 -0.8 -0.9 -0.2 1.2 1.8 1.4 0.1 -0.2 -0.4
## 2010 0.9 1.9 0.8 -0.6 -0.2 0.0 0.7 1.0 1.3 -0.2 -0.9 0.3
## 2011 1.7 1.5 0.7 0.2 -0.9 -0.1 1.4 1.7 0.7 -0.2 -0.3 -0.5
## 2012 1.0 1.1 1.1 0.1 -0.4 0.6 1.6 1.9 1.4 0.2 -0.6 -0.5
## 2013 1.7 1.7 0.3 -0.8 -0.9 -0.2 1.2 1.8 1.4 0.1 -0.2 -0.4
## 2014 0.9 1.9 0.8 -0.6 -0.2 0.0 0.7 1.0 1.3 -0.2 -0.9 0.3
plot(my.ts)
ts(my.data, end=c(2009,1), frequency=4) # ts is smart and go back in time if we just give the 'end' parameter
## Qtr1 Qtr2 Qtr3 Qtr4
## 1997 1.7 1.7 0.3
## 1998 -0.8 -0.9 -0.2 1.2
## 1999 1.8 1.4 0.1 -0.2
## 2000 -0.4 0.9 1.9 0.8
## 2001 -0.6 -0.2 0.0 0.7
## 2002 1.0 1.3 -0.2 -0.9
## 2003 0.3 1.7 1.5 0.7
## 2004 0.2 -0.9 -0.1 1.4
## 2005 1.7 0.7 -0.2 -0.3
## 2006 -0.5 1.0 1.1 1.1
## 2007 0.1 -0.4 0.6 1.6
## 2008 1.9 1.4 0.2 -0.6
## 2009 -0.5
window
makes a subset of a timeseries:
plot( window(my.ts, start=c(2014, 6), end=c(2014, 12)) )
Other operations:
time(my.ts) # creates the vector of times at which a time series was sampled.
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 2009 2009 2009 2009 2009 2009 2009 2010 2010 2010 2010 2010 2010
## 2010 2010 2010 2010 2010 2010 2010 2010 2011 2011 2011 2011 2011
## 2011 2011 2011 2011 2011 2011 2011 2012 2012 2012 2012 2012 2012
## 2012 2012 2012 2012 2012 2012 2012 2012 2013 2013 2013 2013 2013
## 2013 2013 2013 2013 2013 2013 2013 2014 2014 2014 2014 2014 2014
## 2014 2014 2014 2014 2014 2014 2014 2014 2015 2015 2015 2015 2015
cycle(my.ts) # gives the positions in the cycle of each observation.
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 2009 1 2 3 4 5 6 7 8 9 10 11 12
## 2010 1 2 3 4 5 6 7 8 9 10 11 12
## 2011 1 2 3 4 5 6 7 8 9 10 11 12
## 2012 1 2 3 4 5 6 7 8 9 10 11 12
## 2013 1 2 3 4 5 6 7 8 9 10 11 12
## 2014 1 2 3 4 5 6 7 8 9 10 11 12
ts1 <- lag(my.ts, k=12) # lagged version of time series, shifted back k observations
plot(my.ts, , lwd=2, main="comparisation with next year")
points(ts1, type="l", col="red")
ds <- diff(my.ts, d=1) # difference vector the time series, d times
plot( ds )
tui <- read.csv("tui.csv", header=T, dec=",", sep=";")
head(tui)
## date open high low close volume
## 1 20020514 29.01 29.29 28.70 29.00 422606
## 2 20020513 28.81 29.15 28.71 28.93 265548
## 3 20020510 29.30 29.49 28.38 28.69 497579
## 4 20020509 29.26 29.78 28.82 29.01 342824
## 5 20020508 28.50 29.80 28.41 29.70 559890
## 6 20020507 29.50 29.65 27.85 28.21 716809
plot(tui[,5], type="l",lwd=2, col="red", xlab="time", ylab="closing values", main="Stock data of TUI AG", ylim=c(0,60) )
hist(diff(tui[,5]),prob=T,ylim=c(0,0.65),xlim=c(-4.5,4.5),col="grey", breaks=20, main=" histogram of the differences")
lines(density(diff(tui[,5])),lwd=2)
points(seq(-4.5,4.5,len=100),dnorm(seq(-4.5,4.5,len=100), mean(diff(tui[,5])), sd(diff(tui[,5]))), col="red", type="l",lwd=2)
The Kolgomorov-Smirnoff test checks is a sample – in this case the differences between consecutive values of the time series – follows a specific distribution:
ds <- diff(log(tui[,5]))
ks.test(ds, "pnorm", mean(ds), sd(ds)) # it seems so
## Warning: ties should not be present for the Kolmogorov-Smirnov test
##
## One-sample Kolmogorov-Smirnov test
##
## data: ds
## D = 0.098, p-value = 2.487e-05
## alternative hypothesis: two-sided
qqnorm(diff(tui[,5])); abline(0,1,col="red") # another normality test, this time visual
shapiro.test(ds) # test for normality (should fail for ds with p-value >= 0.05)
##
## Shapiro-Wilk normality test
##
## data: ds
## W = 0.8096, p-value < 2.2e-16
A common method to extract the trend component \(T_t\) from time series \(X_t\) is to apply filters, \[T_t = \sum_{i=-\infty}^{+\infty} \lambda_iX_{t+i}\]
A common method is moving averages: \[T_t = \frac{1}{2a+1} \sum_{-a}^a X_{t+i}\]
In R this is done with filter
:
a <- 20; tui.ma1 <- filter(tui[,5], filter=rep(1/a,a)) # check also package TTR about functions SMA et al.
a <- 50; tui.ma2 <- filter(tui[,5], filter=rep(1/a,a))
ts.plot(tui[,5], tui.ma1, tui.ma2, col=1:3, lwd=c(1,2,2))
Function stl
performs a seasonal decomposition of \(X_t\) by determining \(T_t\) using a loess regression (linear regression ‘plus’ k-nearest-neighbors), and then calculating the seasonal component \(S_t\) and residuals \(e_t\) from the differences \(X_t-T_t\).
An eg:
beer <- read.csv("beer.csv", header=T, dec=",", sep=";")
beer <- ts(beer[,1],start=1956,freq=12)
plot(stl(log(beer), s.window="periodic"))
Let’s say we would want to fit the beer time series by the following model:
\[log(X_t) = \alpha_0 + \alpha_1 t + \alpha_2 t^2 + e_t\]
logbeer <- log(beer)
t <- seq(1956, 1995.2, len=length(beer))
t2 <- t^2
model <- lm(logbeer ~ t + t2)
plot(logbeer)
lines(t, model$fit, col="red", lwd=2)
But we are not considering the seasonal component. So let’s improve the model adding the first Fourier harmonics, \(cos(2\pi t/P)\) and \(sin(2\pi t/P\) where \(P\) is \(12\) given the data is in months:
\[log(X_t) = \alpha_0 + \alpha_1 t + \alpha_2 t^2 + \beta \cos(\frac{2\pi}{12}) + \gamma \sin(\frac{2\pi}{12}) + e_t\]
cos.t <- cos(2*pi*t) # the period P=12 is already included in the time series
sin.t <- sin(2*pi*t)
model <- lm(logbeer ~ t + t2 + cos.t + sin.t)
plot(logbeer)
lines(t, model$fit, col="red", lwd=2)
We can check what coefficients are significant:
summary(model)
##
## Call:
## lm(formula = logbeer ~ t + t2 + cos.t + sin.t)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3319 -0.0866 -0.0031 0.0818 0.3452
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.83e+03 1.84e+02 -20.82 <2e-16 ***
## t 3.87e+00 1.86e-01 20.75 <2e-16 ***
## t2 -9.75e-04 4.72e-05 -20.66 <2e-16 ***
## cos.t -1.25e-02 7.67e-03 -1.62 0.1
## sin.t -1.08e-01 7.68e-03 -14.04 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.118 on 471 degrees of freedom
## Multiple R-squared: 0.802, Adjusted R-squared: 0.8
## F-statistic: 476 on 4 and 471 DF, p-value: <2e-16
The only coefficient that seems to not differ significantly from zero is the cosine component.
One way to estimate the next value of a time series \(x_t\) is \[\hat{x} = \lambda_1 x_{t-1} + \lambda_2 x_{t-2} + \ldots\]
It seems reasonable to weight recent observations more than observations less recent. One possibility is to use geometric weights: \[\lambda_i = \alpha(1-\alpha)^i;~~ 0 \lt \alpha \lt 1\]
This is called exponential smoothing and, in its basic form, should be used with time series with no trend or seasonal components. This method, however, has been extended to the Holt-Winters smoothing that accepts time series with those components.
This method has three parameters, \(\alpha, \beta, \gamma\). When \(\gamma=FALSE\) the function assumes no seasonal component.
model <- HoltWinters(beer)
plot(beer)
lines(model$fitted[,"xhat"], col="red")
To predict the next values in the time series:
pred <- predict(model, n.ahead=24) # values for the next 24 months
plot(beer, xlim=c(1956,1997))
lines(pred, col="red", lty=2)
An eg with the tui
time series which does not seem to have a seasonal component:
model <- HoltWinters(tui[,5], gamma=FALSE)
plot(tui[,5], type="l",xlim=c(0,length(tui[,5])+36))
lines(model$fitted[,"xhat"], col="red")
pred <- predict(model, n.ahead=36)
lines(pred, col="red", lty=2)
Function forecast
adds a significance interval to the prediction:
library(forecast)
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Loading required package: timeDate
## This is forecast 5.6
plot(forecast(model, h=40, level=c(75,95))) # the next 40 data points using 75% and 95% confidence intervals
There is also function ets
that returns an exponential smoothing model:
fit <- ets(my.ts) # Automated forecasting using an exponential model
plot(forecast(fit, level=c(75,95))) # using 75% and 95% confidence intervals
It’s possible to remove the trend component using decompose
or stl
:
report <- decompose(beer, type="multiplicative")
plot(beer - report$trend, main="signal without trend component")
plot(report$seasonal, main="signal without trend and irregular components")
report <- stl(beer, s.window="periodic")
plot(beer - report$time.series[,"trend"], type="l", main="signal without trend component")
plot(report$time.series[,"seasonal"], type="l", main="signal without trend and irregular components")
Those functions only work with detectable seasonal signals. It’s still possible to detrend using the residuals of a linear regression:
idxs <- 1:nrow(tui)
trend <- lm(tui[,"close"] ~ idxs)
plot(tui[,5], type="l")
abline(trend, col="red", lwd=2)
detrended <- trend$residuals
plot(detrended, type="l")
We can also use Fourier Analysis to find some pattern in the seasonal (plus irregular) component:
library(GeneCycle)
## Loading required package: MASS
## Loading required package: longitudinal
## Loading required package: corpcor
## Loading required package: fdrtool
f.data <- GeneCycle::periodogram(detrended)
harmonics <- 1:30 # find patterns up to one month
plot(f.data$freq[harmonics]*length(detrended),
f.data$spec[harmonics]/sum(f.data$spec),
xlab="Harmonics (Hz)", ylab="Amplitute Density", type="h")
ARIMA means Autoregressive integrated moving average. While exponential smoothing methods do not make any assumptions about correlations between successive values of the time series, in some cases you can make a better predictive model by taking correlations in the data into account. ARIMA models include an explicit statistical model for the irregular component of a time series, that allows for non-zero autocorrelations in the irregular component.
It consists of three stages
Model Identification
Parameter Estimation
Diagnostic Checking
These stages are repeated until a ‘suitable’ model is identified.
ARIMA models have three parameters, ARIMA(\(p\),\(d\),\(q\)). Parameter \(p\) concerns the AR part (the autoregression), parameter \(d\) the I part, and \(q\) the MA (moving average).
For this eg the data is from the age of death of 42 successive kings of England:
kings <- scan("kings.dat",skip=3)
plot(kings, type="l", main="Age of death of 42 successive kings of England")
ARIMA models are defined for stationary time series, ie, the statistical properties – mean and variance in our case – do not change over time. If you start off with a non-stationary time series, you will first need to ‘difference’ the time series until you obtain a stationary time series. If you have to difference the time series d times to obtain a stationary series, then you have an ARIMA(p,d,q) model, where d is the order of differencing used.
Box.test
tests whether there is significant evidence for non-zero correlations at lags autocorrelation coefficients. Small p-values (i.e., less than 0.05) suggest that the series is stationary.
kings.dif <- diff(kings, d=1)
Box.test(kings.dif, lag=10)
##
## Box-Pierce test
##
## data: kings.dif
## X-squared = 11.32, df = 10, p-value = 0.3332
kings.dif <- diff(kings, d=2)
Box.test(kings.dif, lag=10)
##
## Box-Pierce test
##
## data: kings.dif
## X-squared = 16.39, df = 10, p-value = 0.08907
kings.dif <- diff(kings, d=3)
Box.test(kings.dif, lag=10)
##
## Box-Pierce test
##
## data: kings.dif
## X-squared = 21.91, df = 10, p-value = 0.01555
plot(kings.dif, type="l")
# other tests
library(fpp)
## Loading required package: fma
## Loading required package: tseries
##
## Attaching package: 'fma'
##
## The following object is masked _by_ '.GlobalEnv':
##
## beer
##
## The following objects are masked from 'package:MASS':
##
## cement, housing, petrol
##
## Loading required package: expsmooth
## Loading required package: lmtest
adf.test(kings.dif, alternative="stationary") # small p-values suggest the data is stationary
##
## Augmented Dickey-Fuller Test
##
## data: kings.dif
## Dickey-Fuller = -8.352, Lag order = 3, p-value = 0.01
## alternative hypothesis: stationary
kpss.test(kings.dif) # small p-values suggest that the series is not stationary and a differencing is required
##
## KPSS Test for Level Stationarity
##
## data: kings.dif
## KPSS Level = 0.0354, Truncation lag parameter = 1, p-value = 0.1
So, in this case we are going to use \(d=3\), this means ARIMA(\(p\),3,\(q\)) models.
The next step is to find appropriate values for \(p\) and \(q\). For this we start by plotting the a correlogram and partial correlogram of the time-series, which is done by acf
and pacf
. The get the values use function parameter plot=FALSE
:
acf(kings.dif, lag.max=20)
acf(kings.dif, lag.max=20, plot=FALSE)
##
## Autocorrelations of series 'kings.dif', by lag
##
## 0 1 2 3 4 5 6 7 8 9
## 1.000 -0.656 0.186 -0.106 0.163 -0.034 -0.136 0.128 -0.020 0.025
## 10 11 12 13 14 15 16 17 18 19
## -0.147 0.187 -0.050 -0.098 0.071 -0.010 0.085 -0.160 0.082 0.024
## 20
## -0.015
pacf(kings.dif, lag.max=20)
pacf(kings.dif, lag.max=20, plot=FALSE)
##
## Partial autocorrelations of series 'kings.dif', by lag
##
## 1 2 3 4 5 6 7 8 9 10
## -0.656 -0.430 -0.460 -0.270 0.063 -0.015 -0.005 0.030 0.104 -0.097
## 11 12 13 14 15 16 17 18 19 20
## -0.027 0.115 -0.045 -0.081 -0.132 -0.004 0.023 -0.008 -0.011 -0.029
Only the correlation at lag \(1\) exceeds the significance bounds. The partial correlogram shows that the partial autocorrelations at lags 1, 2 and 3 exceed the significance bounds, are negative, and are slowly decreasing in magnitude with increasing lag.
Since the correlogram is zero after lag 1, and the partial correlogram tails off to zero after lag 3, this means that the following ARMA (autoregressive moving average) models are possible for the time series of third differences:
an ARMA(3,0) model, that is, an autoregressive model of order p=3, since the partial autocorrelogram is zero after lag 3, and the autocorrelogram tails off to zero (although perhaps too abruptly for this model to be appropriate)
an ARMA(0,1) model, that is, a moving average model of order q=1, since the autocorrelogram is zero after lag 1 and the partial autocorrelogram tails off to zero
an ARMA(p,q) model, that is, a mixed model with p and q greater than 0, since the autocorrelogram and partial correlogram tail off to zero (although the correlogram probably tails off to zero too abruptly for this model to be appropriate)
We use the principle of parsimony to decide which model is best: that is, we assume that the model with the fewest parameters is best. The ARMA(3,0) model has 3 parameters, the ARMA(0,1) model has 1 parameter, and the ARMA(p,q) model has at least 2 parameters. Therefore, the ARMA(0,1) model is taken as the best model.
Finally the chosen model with be an ARIMA(0,3,1).
There is a R function (surprise!) that finds an appropriate ARIMA model:
library(forecast)
report <- auto.arima(kings) # parameter ic="bic"" penalises the number of parameters
report
## Series: kings
## ARIMA(0,1,1) with drift
##
## Coefficients:
## ma1 drift
## -0.746 0.388
## s.e. 0.129 0.660
##
## sigma^2 estimated as 234: log likelihood=-165.8
## AIC=337.5 AICc=338.2 BIC=342.7
Knowing the parameters we use arima
to fit this model to the original time series:
kings.arima <- arima(kings, order=c(0,1,1))
kings.arima
## Series: kings
## ARIMA(0,1,1)
##
## Coefficients:
## ma1
## -0.722
## s.e. 0.121
##
## sigma^2 estimated as 230: log likelihood=-170.1
## AIC=344.1 AICc=344.4 BIC=347.6
kings.forecast <- forecast.Arima(kings.arima, h=5, level=c(.75, .90, .99)) # confidence levels
kings.forecast
## Point Forecast Lo 75 Hi 75 Lo 90 Hi 90 Lo 99 Hi 99
## 43 67.75 50.29 85.21 42.78 92.72 28.65 106.9
## 44 67.75 49.62 85.88 41.83 93.67 27.16 108.3
## 45 67.75 48.98 86.52 40.92 94.58 25.73 109.8
## 46 67.75 48.37 87.14 40.03 95.47 24.35 111.2
## 47 67.75 47.77 87.73 39.18 96.33 23.00 112.5
plot.forecast(kings.forecast)
It is a good idea to investigate whether the forecast errors of an ARIMA model are normally distributed with mean zero and constant variance, and whether the are correlations between successive forecast errors.
acf(kings.forecast$residuals, lag=20) # seems ok
Box.test(kings.forecast$residuals, lag=20, type="Ljung-Box")
##
## Box-Ljung test
##
## data: kings.forecast$residuals
## X-squared = 13.58, df = 20, p-value = 0.8509
Since the p-value for the Ljung-Box test is 0.85, we can conclude that there is very little evidence for non-zero autocorrelations in the forecast errors at lags 1-20.
# investigate whether the forecast errors are normally distributed with mean zero and constant variance
plotForecastErrors <- function(forecasterrors)
{
# make a histogram of the forecast errors:
mybinsize <- IQR(forecasterrors)/4
mysd <- sd(forecasterrors)
mymin <- min(forecasterrors) - mysd*5
mymax <- max(forecasterrors) + mysd*3
# generate normally distributed data with mean 0 and standard deviation mysd
mynorm <- rnorm(10000, mean=0, sd=mysd)
mymin2 <- min(mynorm)
mymax2 <- max(mynorm)
if (mymin2 < mymin) { mymin <- mymin2 }
if (mymax2 > mymax) { mymax <- mymax2 }
# make a red histogram of the forecast errors, with the normally distributed data overlaid:
mybins <- seq(mymin, mymax, mybinsize)
hist(forecasterrors, col="red", freq=FALSE, breaks=mybins)
# freq=FALSE ensures the area under the histogram = 1
# generate normally distributed data with mean 0 and standard deviation mysd
myhist <- hist(mynorm, plot=FALSE, breaks=mybins)
# plot the normal curve as a blue line on top of the histogram of forecast errors:
points(myhist$mids, myhist$density, type="l", col="blue", lwd=2)
}
plotForecastErrors(kings.forecast$residuals)
Find an autoregressive model for the time series, and predict future observations:
my.ts <- ts(my.data, start=c(2009, 1), end=c(2014, 12), frequency=12) # using the sine data
ar.model <- ar(my.ts)
pred <- predict(ar.model, n.ahead=12)
ts.plot(my.ts, pred$pred, lty=c(1:2), col=1:2)