Refs:

Introduction

Time series are different than usual dataseries because there usually contain periodic patterns (weekly, yearly…). To find these patterns its needed different types of analysis, since instead of assuming the sequence of observations does not matter, we are assuming that it matters, old observations help predict new ones.

Time series \(X_t\) usually have two or three components

A time series with seasonal component is called seasoned data (eg, sales per month). Some data series does not have a seasonal component (eg, a population mortality average).

R has a class time series named ts:

my.data <- round( sin( (0:47 %% 12)+1 ) + runif(48)  ,1) # some periodic data
# make a time series with 48 observations from January 2009 to December 2014
my.ts <- ts(my.data, start=c(2009, 1), end=c(2014, 12), frequency=12)
my.ts
##       Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec
## 2009  1.7  1.7  0.3 -0.8 -0.9 -0.2  1.2  1.8  1.4  0.1 -0.2 -0.4
## 2010  0.9  1.9  0.8 -0.6 -0.2  0.0  0.7  1.0  1.3 -0.2 -0.9  0.3
## 2011  1.7  1.5  0.7  0.2 -0.9 -0.1  1.4  1.7  0.7 -0.2 -0.3 -0.5
## 2012  1.0  1.1  1.1  0.1 -0.4  0.6  1.6  1.9  1.4  0.2 -0.6 -0.5
## 2013  1.7  1.7  0.3 -0.8 -0.9 -0.2  1.2  1.8  1.4  0.1 -0.2 -0.4
## 2014  0.9  1.9  0.8 -0.6 -0.2  0.0  0.7  1.0  1.3 -0.2 -0.9  0.3
plot(my.ts)

plot of chunk unnamed-chunk-1

ts(my.data, end=c(2009,1), frequency=4) # ts is smart and go back in time if we just give the 'end' parameter
##      Qtr1 Qtr2 Qtr3 Qtr4
## 1997       1.7  1.7  0.3
## 1998 -0.8 -0.9 -0.2  1.2
## 1999  1.8  1.4  0.1 -0.2
## 2000 -0.4  0.9  1.9  0.8
## 2001 -0.6 -0.2  0.0  0.7
## 2002  1.0  1.3 -0.2 -0.9
## 2003  0.3  1.7  1.5  0.7
## 2004  0.2 -0.9 -0.1  1.4
## 2005  1.7  0.7 -0.2 -0.3
## 2006 -0.5  1.0  1.1  1.1
## 2007  0.1 -0.4  0.6  1.6
## 2008  1.9  1.4  0.2 -0.6
## 2009 -0.5

window makes a subset of a timeseries:

plot( window(my.ts, start=c(2014, 6), end=c(2014, 12)) )

plot of chunk unnamed-chunk-2

Other operations:

time(my.ts)                # creates the vector of times at which a time series was sampled.
##       Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec
## 2009 2009 2009 2009 2009 2009 2009 2010 2010 2010 2010 2010 2010
## 2010 2010 2010 2010 2010 2010 2010 2010 2011 2011 2011 2011 2011
## 2011 2011 2011 2011 2011 2011 2011 2012 2012 2012 2012 2012 2012
## 2012 2012 2012 2012 2012 2012 2012 2012 2013 2013 2013 2013 2013
## 2013 2013 2013 2013 2013 2013 2013 2014 2014 2014 2014 2014 2014
## 2014 2014 2014 2014 2014 2014 2014 2014 2015 2015 2015 2015 2015
cycle(my.ts)               # gives the positions in the cycle of each observation.
##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 2009   1   2   3   4   5   6   7   8   9  10  11  12
## 2010   1   2   3   4   5   6   7   8   9  10  11  12
## 2011   1   2   3   4   5   6   7   8   9  10  11  12
## 2012   1   2   3   4   5   6   7   8   9  10  11  12
## 2013   1   2   3   4   5   6   7   8   9  10  11  12
## 2014   1   2   3   4   5   6   7   8   9  10  11  12
ts1 <- lag(my.ts, k=12)    # lagged version of time series, shifted back k observations
plot(my.ts, , lwd=2, main="comparisation with next year")
points(ts1, type="l", col="red")

plot of chunk unnamed-chunk-3

ds <- diff(my.ts, d=1)     # difference vector the time series, d times
plot( ds )               

plot of chunk unnamed-chunk-3

Preliminary Analysis of Time Series

tui <- read.csv("tui.csv", header=T, dec=",", sep=";")
head(tui)
##       date  open  high   low close volume
## 1 20020514 29.01 29.29 28.70 29.00 422606
## 2 20020513 28.81 29.15 28.71 28.93 265548
## 3 20020510 29.30 29.49 28.38 28.69 497579
## 4 20020509 29.26 29.78 28.82 29.01 342824
## 5 20020508 28.50 29.80 28.41 29.70 559890
## 6 20020507 29.50 29.65 27.85 28.21 716809
plot(tui[,5], type="l",lwd=2, col="red", xlab="time", ylab="closing values", main="Stock data of TUI AG", ylim=c(0,60) )

plot of chunk unnamed-chunk-4

hist(diff(tui[,5]),prob=T,ylim=c(0,0.65),xlim=c(-4.5,4.5),col="grey", breaks=20, main=" histogram of the differences")
lines(density(diff(tui[,5])),lwd=2)
points(seq(-4.5,4.5,len=100),dnorm(seq(-4.5,4.5,len=100), mean(diff(tui[,5])), sd(diff(tui[,5]))), col="red", type="l",lwd=2)

plot of chunk unnamed-chunk-4

The Kolgomorov-Smirnoff test checks is a sample – in this case the differences between consecutive values of the time series – follows a specific distribution:

ds <- diff(log(tui[,5]))
ks.test(ds, "pnorm", mean(ds), sd(ds))          # it seems so
## Warning: ties should not be present for the Kolmogorov-Smirnov test
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  ds
## D = 0.098, p-value = 2.487e-05
## alternative hypothesis: two-sided
qqnorm(diff(tui[,5])); abline(0,1,col="red")  # another normality test, this time visual

plot of chunk unnamed-chunk-5

shapiro.test(ds)  # test for normality (should fail for ds with p-value >= 0.05)
## 
##  Shapiro-Wilk normality test
## 
## data:  ds
## W = 0.8096, p-value < 2.2e-16

Linear Filtering

A common method to extract the trend component \(T_t\) from time series \(X_t\) is to apply filters, \[T_t = \sum_{i=-\infty}^{+\infty} \lambda_iX_{t+i}\]

A common method is moving averages: \[T_t = \frac{1}{2a+1} \sum_{-a}^a X_{t+i}\]

In R this is done with filter:

a <- 20; tui.ma1 <- filter(tui[,5], filter=rep(1/a,a)) # check also package TTR about functions SMA et al.
a <- 50; tui.ma2 <- filter(tui[,5], filter=rep(1/a,a)) 
ts.plot(tui[,5], tui.ma1, tui.ma2, col=1:3, lwd=c(1,2,2))

plot of chunk unnamed-chunk-6

Function stl performs a seasonal decomposition of \(X_t\) by determining \(T_t\) using a loess regression (linear regression ‘plus’ k-nearest-neighbors), and then calculating the seasonal component \(S_t\) and residuals \(e_t\) from the differences \(X_t-T_t\).

An eg:

beer <- read.csv("beer.csv", header=T, dec=",", sep=";")
beer <- ts(beer[,1],start=1956,freq=12)
plot(stl(log(beer), s.window="periodic"))

plot of chunk unnamed-chunk-7

Using Linear Regression

Let’s say we would want to fit the beer time series by the following model:

\[log(X_t) = \alpha_0 + \alpha_1 t + \alpha_2 t^2 + e_t\]

logbeer <- log(beer)

t  <- seq(1956, 1995.2, len=length(beer))
t2 <- t^2
model <- lm(logbeer ~ t + t2)

plot(logbeer)
lines(t, model$fit, col="red", lwd=2)

plot of chunk unnamed-chunk-8

But we are not considering the seasonal component. So let’s improve the model adding the first Fourier harmonics, \(cos(2\pi t/P)\) and \(sin(2\pi t/P\) where \(P\) is \(12\) given the data is in months:

\[log(X_t) = \alpha_0 + \alpha_1 t + \alpha_2 t^2 + \beta \cos(\frac{2\pi}{12}) + \gamma \sin(\frac{2\pi}{12}) + e_t\]

cos.t <- cos(2*pi*t) # the period P=12 is already included in the time series
sin.t <- sin(2*pi*t)
model <- lm(logbeer ~ t + t2 + cos.t + sin.t)

plot(logbeer)
lines(t, model$fit, col="red", lwd=2)

plot of chunk unnamed-chunk-9

We can check what coefficients are significant:

summary(model)
## 
## Call:
## lm(formula = logbeer ~ t + t2 + cos.t + sin.t)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3319 -0.0866 -0.0031  0.0818  0.3452 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.83e+03   1.84e+02  -20.82   <2e-16 ***
## t            3.87e+00   1.86e-01   20.75   <2e-16 ***
## t2          -9.75e-04   4.72e-05  -20.66   <2e-16 ***
## cos.t       -1.25e-02   7.67e-03   -1.62      0.1    
## sin.t       -1.08e-01   7.68e-03  -14.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.118 on 471 degrees of freedom
## Multiple R-squared:  0.802,  Adjusted R-squared:   0.8 
## F-statistic:  476 on 4 and 471 DF,  p-value: <2e-16

The only coefficient that seems to not differ significantly from zero is the cosine component.

Exponential Smoothing

One way to estimate the next value of a time series \(x_t\) is \[\hat{x} = \lambda_1 x_{t-1} + \lambda_2 x_{t-2} + \ldots\]

It seems reasonable to weight recent observations more than observations less recent. One possibility is to use geometric weights: \[\lambda_i = \alpha(1-\alpha)^i;~~ 0 \lt \alpha \lt 1\]

This is called exponential smoothing and, in its basic form, should be used with time series with no trend or seasonal components. This method, however, has been extended to the Holt-Winters smoothing that accepts time series with those components.

This method has three parameters, \(\alpha, \beta, \gamma\). When \(\gamma=FALSE\) the function assumes no seasonal component.

model <- HoltWinters(beer)
plot(beer)
lines(model$fitted[,"xhat"], col="red")

plot of chunk unnamed-chunk-11

To predict the next values in the time series:

pred <- predict(model, n.ahead=24) # values for the next 24 months
plot(beer, xlim=c(1956,1997))
lines(pred, col="red", lty=2)

plot of chunk unnamed-chunk-12

An eg with the tui time series which does not seem to have a seasonal component:

model <- HoltWinters(tui[,5], gamma=FALSE)
plot(tui[,5], type="l",xlim=c(0,length(tui[,5])+36))
lines(model$fitted[,"xhat"], col="red")
pred <- predict(model, n.ahead=36)
lines(pred, col="red", lty=2)

plot of chunk unnamed-chunk-13

Function forecast adds a significance interval to the prediction:

library(forecast)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Loading required package: timeDate
## This is forecast 5.6
plot(forecast(model, h=40, level=c(75,95))) # the next 40 data points using 75% and 95% confidence intervals

plot of chunk unnamed-chunk-14

There is also function ets that returns an exponential smoothing model:

fit <- ets(my.ts) # Automated forecasting using an exponential model
plot(forecast(fit, level=c(75,95))) # using 75% and 95% confidence intervals

plot of chunk unnamed-chunk-15

Detrending

It’s possible to remove the trend component using decompose or stl:

report <- decompose(beer, type="multiplicative")
plot(beer - report$trend, main="signal without trend component")

plot of chunk unnamed-chunk-16

plot(report$seasonal, main="signal without trend and irregular components")

plot of chunk unnamed-chunk-16

report <- stl(beer, s.window="periodic")
plot(beer - report$time.series[,"trend"], type="l", main="signal without trend component")

plot of chunk unnamed-chunk-16

plot(report$time.series[,"seasonal"], type="l", main="signal without trend and irregular components")

plot of chunk unnamed-chunk-16

Those functions only work with detectable seasonal signals. It’s still possible to detrend using the residuals of a linear regression:

idxs  <- 1:nrow(tui)
trend <- lm(tui[,"close"] ~ idxs)
plot(tui[,5], type="l")
abline(trend, col="red", lwd=2)

plot of chunk unnamed-chunk-17

detrended <- trend$residuals
plot(detrended, type="l")

plot of chunk unnamed-chunk-17

We can also use Fourier Analysis to find some pattern in the seasonal (plus irregular) component:

library(GeneCycle)
## Loading required package: MASS
## Loading required package: longitudinal
## Loading required package: corpcor
## Loading required package: fdrtool
f.data <- GeneCycle::periodogram(detrended)
harmonics <- 1:30   # find patterns up to one month
plot(f.data$freq[harmonics]*length(detrended), 
     f.data$spec[harmonics]/sum(f.data$spec), 
     xlab="Harmonics (Hz)", ylab="Amplitute Density", type="h")

plot of chunk unnamed-chunk-18

ARIMA models

ARIMA means Autoregressive integrated moving average. While exponential smoothing methods do not make any assumptions about correlations between successive values of the time series, in some cases you can make a better predictive model by taking correlations in the data into account. ARIMA models include an explicit statistical model for the irregular component of a time series, that allows for non-zero autocorrelations in the irregular component.

It consists of three stages

  1. Model Identification

  2. Parameter Estimation

  3. Diagnostic Checking

These stages are repeated until a ‘suitable’ model is identified.

ARIMA models have three parameters, ARIMA(\(p\),\(d\),\(q\)). Parameter \(p\) concerns the AR part (the autoregression), parameter \(d\) the I part, and \(q\) the MA (moving average).

For this eg the data is from the age of death of 42 successive kings of England:

kings <- scan("kings.dat",skip=3)
plot(kings, type="l", main="Age of death of 42 successive kings of England")

plot of chunk unnamed-chunk-19

ARIMA models are defined for stationary time series, ie, the statistical properties – mean and variance in our case – do not change over time. If you start off with a non-stationary time series, you will first need to ‘difference’ the time series until you obtain a stationary time series. If you have to difference the time series d times to obtain a stationary series, then you have an ARIMA(p,d,q) model, where d is the order of differencing used.

Box.test tests whether there is significant evidence for non-zero correlations at lags autocorrelation coefficients. Small p-values (i.e., less than 0.05) suggest that the series is stationary.

kings.dif <- diff(kings, d=1)
Box.test(kings.dif, lag=10)
## 
##  Box-Pierce test
## 
## data:  kings.dif
## X-squared = 11.32, df = 10, p-value = 0.3332
kings.dif <- diff(kings, d=2)
Box.test(kings.dif, lag=10)
## 
##  Box-Pierce test
## 
## data:  kings.dif
## X-squared = 16.39, df = 10, p-value = 0.08907
kings.dif <- diff(kings, d=3)
Box.test(kings.dif, lag=10)
## 
##  Box-Pierce test
## 
## data:  kings.dif
## X-squared = 21.91, df = 10, p-value = 0.01555
plot(kings.dif, type="l")

plot of chunk unnamed-chunk-20

# other tests
library(fpp)
## Loading required package: fma
## Loading required package: tseries
## 
## Attaching package: 'fma'
## 
## The following object is masked _by_ '.GlobalEnv':
## 
##     beer
## 
## The following objects are masked from 'package:MASS':
## 
##     cement, housing, petrol
## 
## Loading required package: expsmooth
## Loading required package: lmtest
adf.test(kings.dif, alternative="stationary") # small p-values suggest the data is stationary
## 
##  Augmented Dickey-Fuller Test
## 
## data:  kings.dif
## Dickey-Fuller = -8.352, Lag order = 3, p-value = 0.01
## alternative hypothesis: stationary
kpss.test(kings.dif) # small p-values suggest that the series is not stationary and a differencing is required
## 
##  KPSS Test for Level Stationarity
## 
## data:  kings.dif
## KPSS Level = 0.0354, Truncation lag parameter = 1, p-value = 0.1

So, in this case we are going to use \(d=3\), this means ARIMA(\(p\),3,\(q\)) models.

The next step is to find appropriate values for \(p\) and \(q\). For this we start by plotting the a correlogram and partial correlogram of the time-series, which is done by acf and pacf. The get the values use function parameter plot=FALSE:

acf(kings.dif,  lag.max=20)

plot of chunk unnamed-chunk-21

acf(kings.dif,  lag.max=20, plot=FALSE)
## 
## Autocorrelations of series 'kings.dif', by lag
## 
##      0      1      2      3      4      5      6      7      8      9 
##  1.000 -0.656  0.186 -0.106  0.163 -0.034 -0.136  0.128 -0.020  0.025 
##     10     11     12     13     14     15     16     17     18     19 
## -0.147  0.187 -0.050 -0.098  0.071 -0.010  0.085 -0.160  0.082  0.024 
##     20 
## -0.015
pacf(kings.dif, lag.max=20)

plot of chunk unnamed-chunk-21

pacf(kings.dif, lag.max=20, plot=FALSE)
## 
## Partial autocorrelations of series 'kings.dif', by lag
## 
##      1      2      3      4      5      6      7      8      9     10 
## -0.656 -0.430 -0.460 -0.270  0.063 -0.015 -0.005  0.030  0.104 -0.097 
##     11     12     13     14     15     16     17     18     19     20 
## -0.027  0.115 -0.045 -0.081 -0.132 -0.004  0.023 -0.008 -0.011 -0.029

Only the correlation at lag \(1\) exceeds the significance bounds. The partial correlogram shows that the partial autocorrelations at lags 1, 2 and 3 exceed the significance bounds, are negative, and are slowly decreasing in magnitude with increasing lag.

Since the correlogram is zero after lag 1, and the partial correlogram tails off to zero after lag 3, this means that the following ARMA (autoregressive moving average) models are possible for the time series of third differences:

We use the principle of parsimony to decide which model is best: that is, we assume that the model with the fewest parameters is best. The ARMA(3,0) model has 3 parameters, the ARMA(0,1) model has 1 parameter, and the ARMA(p,q) model has at least 2 parameters. Therefore, the ARMA(0,1) model is taken as the best model.

Finally the chosen model with be an ARIMA(0,3,1).

There is a R function (surprise!) that finds an appropriate ARIMA model:

library(forecast)
report <- auto.arima(kings) #  parameter ic="bic"" penalises the number of parameters
report
## Series: kings 
## ARIMA(0,1,1) with drift         
## 
## Coefficients:
##          ma1  drift
##       -0.746  0.388
## s.e.   0.129  0.660
## 
## sigma^2 estimated as 234:  log likelihood=-165.8
## AIC=337.5   AICc=338.2   BIC=342.7

Knowing the parameters we use arima to fit this model to the original time series:

kings.arima <- arima(kings, order=c(0,1,1))
kings.arima
## Series: kings 
## ARIMA(0,1,1)                    
## 
## Coefficients:
##          ma1
##       -0.722
## s.e.   0.121
## 
## sigma^2 estimated as 230:  log likelihood=-170.1
## AIC=344.1   AICc=344.4   BIC=347.6
kings.forecast <- forecast.Arima(kings.arima, h=5, level=c(.75, .90, .99))  # confidence levels
kings.forecast
##    Point Forecast Lo 75 Hi 75 Lo 90 Hi 90 Lo 99 Hi 99
## 43          67.75 50.29 85.21 42.78 92.72 28.65 106.9
## 44          67.75 49.62 85.88 41.83 93.67 27.16 108.3
## 45          67.75 48.98 86.52 40.92 94.58 25.73 109.8
## 46          67.75 48.37 87.14 40.03 95.47 24.35 111.2
## 47          67.75 47.77 87.73 39.18 96.33 23.00 112.5
plot.forecast(kings.forecast)

plot of chunk unnamed-chunk-23

It is a good idea to investigate whether the forecast errors of an ARIMA model are normally distributed with mean zero and constant variance, and whether the are correlations between successive forecast errors.

acf(kings.forecast$residuals, lag=20) # seems ok

plot of chunk unnamed-chunk-24

Box.test(kings.forecast$residuals, lag=20, type="Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  kings.forecast$residuals
## X-squared = 13.58, df = 20, p-value = 0.8509

Since the p-value for the Ljung-Box test is 0.85, we can conclude that there is very little evidence for non-zero autocorrelations in the forecast errors at lags 1-20.

# investigate whether the forecast errors are normally distributed with mean zero and constant variance
plotForecastErrors <- function(forecasterrors)
  {
     # make a histogram of the forecast errors:
     mybinsize <- IQR(forecasterrors)/4
     mysd   <- sd(forecasterrors)
     mymin  <- min(forecasterrors) - mysd*5
     mymax  <- max(forecasterrors) + mysd*3
     # generate normally distributed data with mean 0 and standard deviation mysd
     mynorm <- rnorm(10000, mean=0, sd=mysd)
     mymin2 <- min(mynorm)
     mymax2 <- max(mynorm)
     if (mymin2 < mymin) { mymin <- mymin2 }
     if (mymax2 > mymax) { mymax <- mymax2 }
     # make a red histogram of the forecast errors, with the normally distributed data overlaid:
     mybins <- seq(mymin, mymax, mybinsize)
     hist(forecasterrors, col="red", freq=FALSE, breaks=mybins)
     # freq=FALSE ensures the area under the histogram = 1
     # generate normally distributed data with mean 0 and standard deviation mysd
     myhist <- hist(mynorm, plot=FALSE, breaks=mybins)
     # plot the normal curve as a blue line on top of the histogram of forecast errors:
     points(myhist$mids, myhist$density, type="l", col="blue", lwd=2)
  }

plotForecastErrors(kings.forecast$residuals)

plot of chunk unnamed-chunk-25

Autoregressive model

Find an autoregressive model for the time series, and predict future observations:

my.ts <- ts(my.data, start=c(2009, 1), end=c(2014, 12), frequency=12) # using the sine data
ar.model <- ar(my.ts)       
pred <- predict(ar.model, n.ahead=12)
ts.plot(my.ts, pred$pred, lty=c(1:2), col=1:2)

plot of chunk unnamed-chunk-26