16 Nonlinear AR, RNN

#RNN #AR #NeuralNetwork #LinearRegression #LinearModel

1 RNN Motivation

First we recap some models we have discussed.

1.1 Regression with $t$ as Covariate

Consider the simplest linear regression model: $\begin{matrix} (1.1) & y_{t} = β_{0} + β_{1} t + ε_{t}, ε_{t} \overset{i . i . d}{\sim} N (0, σ^{2}) . \end{matrix}$
Then consider nonlinear model ( change of slope model): $\begin{matrix} (1.2) & y_{t} = \be | t a_{0} + β_{1} t + β_{2} (t - c_{1})_{+} + \dots + β_{k + 1} (t - c_{k})_{+} + ε_{t} . \end{matrix}$ We can also notate as $σ (u) = ReLU (u) = u_{+}$ . Unknown parameters here are $β_{0}, \dots, β_{k + 1}, c_{1}, \dots, c_{k}$ and $σ$ . This is a linear model for feature vector $(1, t, (t - c_{1})_{+}, \dots, (t - c_{k})_{+})^{T}$ .

Rewrite (1.2): let $x_{t} = t$ , and $μ_{t}$ for the mean of $y_{t}$ . Also denote $\begin{aligned} r_{t} & = (σ (x_{t} - c_{1}), \dots, σ (x_{t} - c_{k}))^{T}, \\ s_{t} & = (x_{t} - c_{1}, \dots, x_{t} - c_{k})^{T} . \end{aligned}$ Now (1.2) becomes $\begin{aligned} x_{t} & = t, \\ s_{t} & = (x_{t} - c_{1}, \dots, x_{t} - c_{k})^{T}, \\ (1.3) & r_{t} & = σ (s_{t}), \\ μ_{t} & = β_{0} + β^{T} r_{t}, \\ y_{t} & = μ_{t} + ε_{t} . \end{aligned}$
In words, $x_{t}$ is first converted to $s_{t}$ by linear function, then we apply $σ (\cdot)$ to generate $r_{t}$ . Then $μ_{t}$ is a linear function of $r_{t}$ which serves as the mean to $y_{t}$ .

1.2 AR

Consider AR model $AR (1)$ : $x_{t} = y_{t - 1}$ . This is simply (1.1) with $t$ replaced by $x_{t} = y_{t - 1}$ : $y_{t} = β_{0} + β_{1} x_{t} + ε_{t}, ε_{t} \overset{i . i . d}{\sim} N (0, σ^{2}) .$
One can create a nonlinear version by using (1.3) with $x_{t} = y_{t - 1}$ . Call this $NAR (1)$ : $\begin{aligned} x_{t} & = y_{t - 1}, \\ s_{t} & = (x_{t} - c_{1}, \dots, x_{t} - c_{k})^{T}, \\ (1.4) & r_{t} & = σ (s_{t}), \\ μ_{t} & = β_{0} + β^{T} r_{t}, \\ y_{t} & = μ_{t} + ε_{t} . \end{aligned}$
Now let's consider $AR (p)$ : $\begin{aligned} x_{t} & = (y_{t - 1}, \dots, y_{t - p})^{T}, \\ (1.5) & μ_{t} & = β_{0} + β^{T} x_{t}, \\ y_{t} & = μ_{t} + ε_{t} . \end{aligned}$ Note that (1.5) can be written in compressed form $y_{t} = β_{0} + β_{1} y_{t - 1} + \dots + β_{p} y_{t - p} + ε_{t}$ .

Now we extend to nonlinear version like (1.4): $\begin{aligned} x_{t} & = (y_{t - 1}, \dots, y_{t - p})^{T}, \\ s_{t} & = (x_{t 1} - c_{1}^{(1)}, \dots, x_{t 1} - c_{k}^{(1)}, x_{t 2} - c_{1}^{(2)}, \dots, x_{t 2} - c_{k}^{(2)}, \dots, x_{t p} - c_{1}^{(p)}, \dots, x_{t p} - c_{k}^{(p)})^{T}, \\ r_{t} & = σ (s_{t}), \\ μ_{t} & = β_{0} + β^{T} r_{t}, \\ y_{t} & = μ_{t} + ε_{t} . \end{aligned}$
Here $x_{t 1} = y_{t - 1}, \dots x_{t p} = y_{t - p}$ denote the components of $x_{t}$ . With this choice of $s_{t}$ , note that $μ_{t} = β_{0} + β^{T} r_{t} = β_{0} + β^{T} σ (s_{t}) = β_{0} + \sum_{j = 1}^{p} g_{j} (x_{t j}),$ where $g_{j} (x) = \sum_{i = 1}^{k} β_{i, j} σ (x_{t j} - c_{j}^{(i)})$ .
In other words, we are fitting an additive model for $y_{t}$ in terms of covariates $x_{t 1} = y_{t - 1}, \dots, x_{t p} = y_{t - p}$ . However, it can't handle interactions between covariates.

Instead of the above additive model, we shall use $NAR (p)$ : $\begin{aligned} x_{t} & = (y_{t - 1}, \dots, y_{t - p})^{T}, \\ s_{t} & = W x_{t} + b, \\ (1.7) & r_{t} & = σ (s_{t}), \\ μ_{t} & = β_{0} + β^{T} r_{t}, \\ y_{t} & = μ_{t} + ε_{t} . \end{aligned}$
Here $W \in R^{k \times p}$ and $b \in R^{k \times 1}$ . Parameters here are $W$ , $b$ , $β_{0}$ , $β$ and $σ$ .

As for neural network, (1.7) is called a single-hidden layer neural network (see here).

2 RNNs

In (1.7), we do one modification: $\begin{aligned} x_{t} & = (y_{t - 1}, \dots, y_{t - p})^{T}, \\ s_{t} & = W_{r} r_{t - 1} + W x_{t} + b, \\ (2.1) & r_{t} & = σ (s_{t}), \\ μ_{t} & = β_{0} + β^{T} r_{t}, \\ y_{t} & = μ_{t} + ε_{t} . \end{aligned}$
Here $r_{t}$ involves not just current $x_{t}$ but also previous hidden layer output $r_{t - 1}$ through $W_{r} r_{t - 1}$ , $W_{r} \in R^{k \times k}$ . Parameters now are $W_{r}, W, b, β_{0}, β$ and $σ$ . Typically $k > p$ . (2.1) also requires an initialization of $r_{t}$ , usually $r_{0} = 0$ .

In (1.7), $r_{t}$ depends only on $x_{t}$ . While in (2.1), $r_{t}$ depends on $x_{t}, x_{t - 1}, \dots, x_{1}$ . Now we assume $x_{t}$ are defined for all $t = 1, 2, \dots$ WLOG. To see how $r_{t}$ depends, note that $\begin{aligned} r_{1} & = σ (W x_{1} + b), (r_{0} = 0), \\ r_{2} & = σ (W_{r} σ (W x_{1} + b) + W x_{2} + b), \\ r_{3} & = σ (W_{r} σ (W_{r} σ (W x_{1} + b) + W x_{2} + b) + W x_{3} + b), \\ r_{4} & = σ (W_{r} σ (W_{r} σ (W_{r} σ (W x_{1} + b) + W x_{2} + b) + W x_{3} + b) + W x_{4} + b) . \end{aligned}$
From the above, $r_{t}$ clearly depends on $x_{1}, \dots, x_{t}$ , but the strength varies.

RNN can have stability issues, causing either gratitude explosion or vanishing. To see this, observe that $\frac{\partial r_{t}}{\partial x_{u}} = σ^{'} (s_{t}) W_{r} σ^{'} (s_{t - 1}) W_{r} \dots σ^{'} (s_{u + 1}) W_{r} σ^{'} (s_{u}) W, u \leq t .$
To solve this, it is customary to take $σ$ as the hyperbolic tangent function $σ_{\tanh} (u) = \frac{e^{u} - e^{- u}}{e^{u} + e^{- u}}$ .
Unlike ReLU, this function takes values between $- 1$ and $1$ : $σ^{'} (u) = 1 - σ^{2} (u) \in (0, 1]$ .

Further, we want $r_{t}$ to represent the ideal summary of $x_{1}, \dots, x_{t}$ . But here in RNN, $r_{t}$ only depends on the inputs closer to $t$ . So RNN doesn't have a very long memory.

1 RNN Motivation

1.1 Regression with t as Covariate

1.2 AR

2 RNNs

1.1 Regression with $t$ as Covariate