17 GRU, LSTM

1 GRU (Gated Recurrent Unit)

Recall formula for RNN. The basic problem is that rt depends on rt1 through Wrrt1. If Wr is a matrix with spectral radius less than 1, Wrrt1 can be thought of as "reducing" rt1 by a factor of Wr. Applied repeatedly, rt,ru 's dependence will be very small. So rt should connect to rt1 not only through Wrrt1.

We first construct a potential version r~t of rt: (1.1)r~t=σ(Wrrt1+Wxt+b). Two natural options:

The idea behind GRU is to combine them rt=ztrt1+(1zt)r~t. This would exactly be a convex combination if zt were a scalar in [0,1], but zt is allowed to be a vector, so we'd better write rt=ztrt1+(1zt)r~t.
For zt, we take (1.2)zt=σsigmoid(Wrzrt1+Wzxt+bz),, where σsigmoid(u)=11+eu. We sometimes refer to zt as a gate. It controls the closeness of rt to rt1 and r~t.

rt1 appears in both ztrt1 and r~t. It might be redundant. So GRU modifies (1.1) by using one more gate: r~t=σ(Wr(rt1gt)+Wxt+b), where gt controls the extent to which rt1 is used in the formula for r~t. Similar to (1.2), gt=σsigmoid(Wrgrt1+Wgxt+bg).
Putting all the formulae together: r0=0,gt=σsigmoid(Wrgrt1+Wgxt+bg),zt=σsigmoid(Wrzrt1+Wzxt+bz),(1.3)r~t=σtanh(Wr(rt1gt)+Wxt+b),rt=ztrt1+(1zt)r~t,μt=β0+βTrt.
zt is called the update gate while gt is called the reset gate. Unknown parameters are Wrg,Wg,bg,Wrz,Wz,bz,Wr,W,b,β0,β.

LSTM (Long Short Term Memory)

This is another modification to the basic RNN for enabling long memory. It has one more gate compared with GRU. Instead of a recursion directly between rt1 and rt, LSTM recursions are between (st1,rt1)(st,rt).

We again construct a potential version: (2.1)r~t=σ(Wrrt1+Wxt+b).
In LSTM, st is taken to be a linear combination of st1 and r~t with gates controlling both coefficients of the linear combination: st=ftst1+itr~t, where ft,it denote gates. rt is defined usually as σtanh(st). In LSTM, we add a gate to rt: rt=otσtanh(st).
Putting everything together, we obtain the full LSTM model: r0=0,ft=σsigmoid(Wrfrt1+Wfxt+bf),it=σsigmoid(Wrirt1+Wixt+bi),ot=σsigmoid(Wrort1+Woxt+bo),(2.2)r~t=σtanh(Wrrt1+Wxt+b),st=ftst1+itr~t,rt=otσtanh(st),μt=β0+βTrt.
ft is called the forget gate, it is called the input gate and ot is called the output gate. Unknown parameters are Wrf,Wf,bf,Wri,Wi,bi,Wro,Wo,bo,Wr,W,b,β0,β.