15 ARIMA and SARIMA Models

#ARMA #ARIMA #BoxJenkins #SARIMA #MLE #BayesianInference #MA

1 ARMA(p, q) Model Rewrite

Recall ARMA model and its backshift notation. We rewrite $ϕ (B) (y_{t} - μ) = θ (B) ε_{t}$ to $ϕ (B) y_{t} = δ + θ (B) ε_{t}$ and try to determine $δ$ . We first write $y_{t} - μ = \frac{θ (B)}{ϕ (B)} ε_{t} .$ Factorize $ϕ (z)$ as $ϕ (z) = (1 - a_{1} z) \dots (1 - a_{p} z) .$ Here $a_{1}^{- 1}, \dots, a_{p}^{- 1}$ are roots of $ϕ (z)$ . Then $y_{t} - μ = \frac{θ (B)}{\prod_{k = 1}^{p} (1 - a_{k} B)} ε_{t} = θ (B) (1 - a_{1} B)^{- 1} \dots (1 - a_{p} B)^{- 1} ε_{t} .$
Recall expansion in (2.4), if all $| a_{k} | < 1$ , rewrite $y_{t} = μ + \sum_{j = 0}^{\infty} ψ_{j} ε_{t - j} .$ This is a causal stationary process.

Since PACF and ACF have not a clear cut off, we usually use AIC, BIC to determine $p, q$ .

2 Box-Jenkins Time Series Modeling Strategy

The idea is

Transform $y_{1}, \dots, y_{n}$ to $x_{t}$ that doesn't have any discernible trends.
Fit an $ARMA (p, q)$ model.

To implement the first idea, we usually have two ways:

Differencing: use $\nabla y_{t} = y_{t} - y_{t - 1}$ , $\nabla^{2} y_{t} = y_{t} - 2 y_{t - 1} + y_{t - 2}$ , etc.
Seasonal Differencing: $\nabla_{s} y_{t} = y_{t} - y_{t - s}$ , $\nabla_{s}^{2} y_{t} = y_{t} - 2 y_{t - s} + y_{t - 2 s}$ .

3 ARIMA Models

ARIMA (AutoRegressive Integrated Moving Average)

$y_{t}$ is $ARIMA (p, d, q)$ if $ϕ (B) ((\nabla^{d} y_{t}) - μ) = θ (B) ε_{t},$ where $ε_{t} \overset{i . i . d}{\sim} N (0, σ^{2})$ .

3.1 Seasonal ARMA Models

We say ${y_{t}}$ is a seasonal $ARMA (P, Q)$ with period $s$ , if $Φ (B^{s}) (y_{t} - μ) = Θ (B^{s}) ε_{t}$ , where $\begin{aligned} Φ (B^{s}) & = 1 - Φ_{1} B^{s} - \dots - Φ_{P} B^{P s}, \\ Θ (B^{s}) & = 1 + Θ_{1} B^{s} + \dots + Θ_{Q} B^{Q s} . \end{aligned}$
This is a special case of an $ARMA (P s, Q s)$ model. But it has $P + Q + 1$ parameters (1 for $σ^{2}$ ) while a general $ARMA (P s, Q s)$ has $P s + Q s + 1$ .

The ACF and PACF of seasonal ARMA are non-zero only at seasonal lags $h = 0. s, 2 s, \dots$ . At seasonal lags, PACF and ACF behave just like unseasonal ARMA model: $Φ (B) X_{t} = Θ (B) ε_{t}$ .

3.2 Multiplicative Seasonal ARMA Models

Multiplicative Seasonal ARMA Model

$ARMA (p, q) \times (P, Q)_{s}$ : $Φ (B^{s}) ϕ (B) (y_{t} - μ) = Θ (B^{s}) θ (B) ε_{t} .$

For a dataset that might have sample autocorrelations nonnegligile at lags $0, 1, 11, 12, 13$ (like co2 dataset), we can use this to reduce parameter.

4 SARIMA Models

SARIMA model

$ARIMA (p, d, q) \times (P, D, Q)_{s}$ : $Φ (B^{s}) ϕ (B) \nabla_{s}^{D} \nabla^{d} (y_{t} - μ) = δ + Θ (B^{s}) θ (B) ε_{t} .$ Recall $\nabla_{s}^{d} = (1 - B^{s})^{d}$ and $\nabla^{d} = (1 - B)^{d}$ .

5 Parameter Estimation in MA(1)

Estimating parameters of ARMA/ARIMA/SARIMA is much harder than AR models. We'll illustrate the difficulty using the example of $MA (1)$ . Recall from here, $MA (1)$ is given by $\begin{matrix} (5.1) & y_{t} = μ + ε_{t} + θ ε_{t - 1} = μ + θ (B) ε_{t}, θ (B) = 1 + θ B, ε_{t} \overset{i . i . d}{\sim} N (0, σ^{2}) . \end{matrix}$
The joint density of $y_{1}, \dots, y_{n}$ is multivariate normal with mean $m = (μ, \dots, μ)^{T}$ and covariance matrix $Σ$ : $Σ (i, j) = {\begin{aligned} σ^{2} (1 + θ^{2}), i = j, \\ σ^{2} θ, | i - j | = 1, \\ 0, else . \end{aligned}$
The likelihood is ${(\frac{1}{\sqrt{2 π}})}^{n} (\det Σ)^{- \frac{1}{2}} \exp (- \frac{1}{2} (y - m)^{T} Σ^{- 1} (y - m)),$ where $y = (y_{1}, \dots, y_{n})^{T}$ . This is a function of $μ, θ, σ$ , which can be estimated by maximizing the logarithm of the likelihood. $Σ^{- 1}$ makes this computationally expensive. We should use some approximation to skip inverse.
An alternative approach is to find a connection to AR models. We can convert to $ε_{t} = \frac{1}{θ (B)} (y_{t} - μ) = (1 - θ B + θ^{2} B^{2} - θ^{3} B^{3} + \dots) (y_{t} - μ),$ so that $y_{t} - θ y_{t - 1} + θ^{2} y_{t - 2} - θ^{3} y_{t - 3} + \dots = \frac{μ}{1 + θ} + ε_{t} .$ This requires $| θ | < 1$ . For this AR model, the likelihood is ${(\frac{1}{\sqrt{2 π} σ})}^{n} \exp (- \frac{1}{2 σ^{2}} \sum_{t = 1}^{n} {(y_{t} - \frac{μ}{1 + θ} - θ y_{t - 1} + θ^{2} y_{t - 2} - θ^{3} y_{t - 3} + \dots)}^{2}) .$ This involves $y_{0}, y_{- 1}, y_{- 2}, \dots$ for which we have no data. We can simply let them to be $0$ . Now it becomes ${(\frac{1}{\sqrt{2 π} σ})}^{n} \exp (- \frac{S (μ, θ)}{2 σ^{2}}),$ where $\begin{aligned} S (μ, θ) = & {(y_{1} - \frac{μ}{1 + θ})}^{2} + {(y_{2} - \frac{μ}{1 + θ} - θ y_{1})}^{2} + \dots \\ + {(y_{n} - \frac{μ}{1 + θ} - θ y_{n - 1} + θ^{2} y_{n - 2} - \dots + (- 1)^{n - 1} θ^{n - 1} y_{1})}^{2} . \end{aligned}$
The MLE of $μ, θ$ comes from $\underset{\hat{μ}, \hat{θ}}{minimize} S (μ, θ) .$
This is a nonlinear minimization that can be done in packages in Python like scipy. It's easy to see that $\hat{σ} = \frac{S (\hat{μ}, \hat{θ})}{n}$ .

For uncertainty quantification, we can take a Bayesian approach. First assume prior $θ \sim Uniform (- 1, 1), μ \sim Uniform (- C, C), \log σ \sim Uniform (- C, C)$ for a large $C \to \infty$ . Note that we restrict $| θ | < 1$ .
The posterior is then $\begin{aligned} f_{μ, θ, σ | data} (μ, θ, σ) & \propto {(\frac{1}{\sqrt{2 π} σ})}^{n} \exp (- \frac{S (μ, θ)}{2 σ^{2}}) \times \frac{1}{σ} 1 {- 1 < θ < 1, - C < μ, \log σ < C} \\ \propto n^{- n - 1} \exp (- \frac{S (μ, θ)}{2 σ^{2}}) 1 {- 1 < θ < 1, - C < μ, \log σ < C} . \end{aligned}$ To obtain the posterior of $μ, θ$ alone, we integrate the above w.r.t $σ$ . Then we have $f_{μ, θ | data} (μ, θ) \propto {(\frac{1}{S (μ, θ)})}^{\frac{n}{2}} 1 {- 1 < θ < 1, - C < μ < C} .$
This can be evaluated numerically over a grid of $μ, θ$ and then approximated. Or, we can approximate with a suitable t distribution by doing a Taylor expansion of $S (μ, θ)$ near $\hat{μ}, \hat{θ}$ . I.e., let $α = (μ, θ)$ and $\hat{α} = (\hat{μ}, \hat{θ})$ : $\begin{aligned} S (α) & = S (\hat{α}) + ⟨ \nabla S (\hat{α}), α - \hat{α} ⟩ + (α - \hat{α})^{T} (\frac{1}{2} H S (\hat{α})) (α - \hat{α}) \\ = S (\hat{α}) + (α - \hat{α})^{T} (\frac{1}{2} H S (\hat{α})) (α - \hat{α}), \end{aligned}$ where we used $\nabla S (\hat{α}) = 0$ because $\hat{α}$ is a minimizer, and $H S (\hat{α})$ is the Hessian of $S$ . Therefore $\begin{aligned} f_{μ, θ | data} (μ, θ) \\ \propto & {(\frac{1}{S (μ, θ)})}^{\frac{n}{2}} 1 {- 1 < θ < 1, - C < μ < C} \\ \propto & {(\frac{S (\hat{α})}{S (α)})}^{\frac{n}{2}} 1 {- 1 < θ < 1, - C < μ < C} \\ = & {(\frac{S (\hat{α})}{S (\hat{α}) + (α - \hat{α})^{T} (\frac{1}{2} H S (\hat{α})) (α - \hat{α})})}^{\frac{n}{2}} 1 {- 1 < θ < 1, - C < μ < C} \\ = & {(\frac{1}{1 + (α - \hat{α})^{T} (\frac{1}{2 S (\hat{α})} H S (\hat{α})) (α - \hat{α})})}^{\frac{n}{2}} 1 {- 1 < θ < 1, - C < μ < C} \\ = & {(\frac{1}{1 + \frac{1}{n - 2} (α - \hat{α})^{T} (\frac{n - 2}{2 S (\hat{α})} H S (\hat{α})) (α - \hat{α})})}^{\frac{n - 2 + 2}{2}} 1 {- 1 < θ < 1, - C < μ < C} \end{aligned}$
Comparing with ${(\frac{1}{1 + \frac{1}{k} (x - m)^{T} Σ^{- 1} (x - m)})}^{\frac{k + p}{2}}$ for the p-variate t-density $t_{k, p} (μ, Σ)$ , we see that $α | data \sim t_{n - 2, 2} (\hat{α}, \frac{S (\hat{α})}{n - 2} {(\frac{1}{2} H S (\hat{α}))}^{- 1}) .$