Recall formula for RNN. The basic problem is that depends on through . If is a matrix with spectral radius less than , can be thought of as "reducing" by a factor of . Applied repeatedly, 's dependence will be very small. So should connect to not only through .
We first construct a potential version of : Two natural options:
. We are back to RNN.
: is exactly equal to , which means is ignored.
The idea behind GRU is to combine them This would exactly be a convex combination if were a scalar in , but is allowed to be a vector, so we'd better write
For , we take , where . We sometimes refer to as a gate. It controls the closeness of to and .
appears in both and . It might be redundant. So GRU modifies (1.1) by using one more gate: where controls the extent to which is used in the formula for . Similar to (1.2),
Putting all the formulae together: is called the update gate while is called the reset gate. Unknown parameters are .
LSTM (Long Short Term Memory)
This is another modification to the basic RNN for enabling long memory. It has one more gate compared with GRU. Instead of a recursion directly between and , LSTM recursions are between .
We again construct a potential version:
In LSTM, is taken to be a linear combination of and with gates controlling both coefficients of the linear combination: where denote gates. is defined usually as . In LSTM, we add a gate to :
Putting everything together, we obtain the full LSTM model: is called the forget gate, is called the input gate and is called the output gate. Unknown parameters are .