17 GRU, LSTM

1 GRU (Gated Recurrent Unit)

Recall formula for RNN. The basic problem is that $r_{t}$ depends on $r_{t - 1}$ through $W_{r} r_{t - 1}$ . If $W_{r}$ is a matrix with spectral radius less than $1$ , $W_{r} r_{t - 1}$ can be thought of as "reducing" $r_{t - 1}$ by a factor of $W_{r}$ . Applied repeatedly, $r_{t}, r_{u}$ 's dependence will be very small. So $r_{t}$ should connect to $r_{t - 1}$ not only through $W_{r} r_{t - 1}$ .

We first construct a potential version ${\tilde{r}}_{t}$ of $r_{t}$ : $\begin{matrix} (1.1) & {\tilde{r}}_{t} = σ (W_{r} r_{t - 1} + W x_{t} + b) . \end{matrix}$ Two natural options:

$r_{t} = {\tilde{r}}_{t}$ . We are back to RNN.
$r_{t} = r_{t - 1}$ : $r_{t}$ is exactly equal to $r_{t - 1}$ , which means $x_{t}$ is ignored.

The idea behind GRU is to combine them $r_{t} = z_{t} r_{t - 1} + (1 - z_{t}) {\tilde{r}}_{t} .$ This would exactly be a convex combination if $z_{t}$ were a scalar in $[0, 1]$ , but $z_{t}$ is allowed to be a vector, so we'd better write $r_{t} = z_{t} ⊙ r_{t - 1} + (1 - z_{t}) ⊙ {\tilde{r}}_{t} .$
For $z_{t}$ , we take $\begin{matrix} (1.2) & z_{t} = σ_{sigmoid} (W_{r}^{z} r_{t - 1} + W^{z} x_{t} + b^{z}), \end{matrix}$ , where $σ_{sigmoid} (u) = \frac{1}{1 + e^{- u}}$ . We sometimes refer to $z_{t}$ as a gate. It controls the closeness of $r_{t}$ to $r_{t - 1}$ and ${\tilde{r}}_{t}$ .

$r_{t - 1}$ appears in both $z_{t} ⊙ r_{t - 1}$ and ${\tilde{r}}_{t}$ . It might be redundant. So GRU modifies (1.1) by using one more gate: ${\tilde{r}}_{t} = σ (W_{r} (r_{t - 1} ⊙ g_{t}) + W x_{t} + b),$ where $g_{t}$ controls the extent to which $r_{t - 1}$ is used in the formula for ${\tilde{r}}_{t}$ . Similar to (1.2), $g_{t} = σ_{sigmoid} (W_{r}^{g} r_{t - 1} + W^{g} x_{t} + b^{g}) .$
Putting all the formulae together: $\begin{aligned} r_{0} & = 0, \\ g_{t} & = σ_{sigmoid} (W_{r}^{g} r_{t - 1} + W^{g} x_{t} + b^{g}), \\ z_{t} & = σ_{sigmoid} (W_{r}^{z} r_{t - 1} + W^{z} x_{t} + b^{z}), \\ (1.3) & {\tilde{r}}_{t} & = σ_{\tanh} (W_{r} (r_{t - 1} ⊙ g_{t}) + W x_{t} + b), \\ r_{t} & = z_{t} ⊙ r_{t - 1} + (1 - z_{t}) ⊙ {\tilde{r}}_{t}, \\ μ_{t} & = β_{0} + β^{T} r_{t} . \end{aligned}$
$z_{t}$ is called the update gate while $g_{t}$ is called the reset gate. Unknown parameters are $W_{r}^{g}, W^{g}, b^{g}, W_{r}^{z}, W^{z}, b^{z}, W_{r}, W, b, β_{0}, β$ .

LSTM (Long Short Term Memory)

This is another modification to the basic RNN for enabling long memory. It has one more gate compared with GRU. Instead of a recursion directly between $r_{t - 1}$ and $r_{t}$ , LSTM recursions are between $(s_{t - 1}, r_{t - 1}) \to (s_{t}, r_{t})$ .

We again construct a potential version: $\begin{matrix} (2.1) & {\tilde{r}}_{t} = σ (W_{r} r_{t - 1} + W x_{t} + b) . \end{matrix}$
In LSTM, $s_{t}$ is taken to be a linear combination of $s_{t - 1}$ and ${\tilde{r}}_{t}$ with gates controlling both coefficients of the linear combination: $s_{t} = f_{t} ⊙ s_{t - 1} + i_{t} ⊙ {\tilde{r}}_{t},$ where $f_{t}, i_{t}$ denote gates. $r_{t}$ is defined usually as $σ_{\tanh} (s_{t})$ . In LSTM, we add a gate to $r_{t}$ : $r_{t} = o_{t} ⊙ σ_{\tanh} (s_{t}) .$
Putting everything together, we obtain the full LSTM model: $\begin{aligned} r_{0} & = 0, \\ f_{t} & = σ_{sigmoid} (W_{r}^{f} r_{t - 1} + W^{f} x_{t} + b^{f}), \\ i_{t} & = σ_{sigmoid} (W_{r}^{i} r_{t - 1} + W^{i} x_{t} + b^{i}), \\ o_{t} & = σ_{sigmoid} (W_{r}^{o} r_{t - 1} + W^{o} x_{t} + b^{o}), \\ (2.2) & {\tilde{r}}_{t} & = σ_{\tanh} (W_{r} r_{t - 1} + W x_{t} + b), \\ s_{t} & = f_{t} ⊙ s_{t - 1} + i_{t} ⊙ {\tilde{r}}_{t}, \\ r_{t} & = o_{t} ⊙ σ_{\tanh} (s_{t}), \\ μ_{t} & = β_{0} + β^{T} r_{t} . \end{aligned}$
$f_{t}$ is called the forget gate, $i_{t}$ is called the input gate and $o_{t}$ is called the output gate. Unknown parameters are $W_{r}^{f}, W^{f}, b^{f}, W_{r}^{i}, W^{i}, b^{i}, W_{r}^{o}, W^{o}, b^{o}, W_{r}, W, b, β_{0}, β$ .