16 Nonlinear AR, RNN

1 RNN Motivation

First we recap some models we have discussed.

1.1 Regression with t as Covariate

Consider the simplest linear regression model: (1.1)yt=β0+β1t+εt,εti.i.dN(0,σ2).
Then consider nonlinear model ( change of slope model): (1.2)yt=\be|ta0+β1t+β2(tc1)+++βk+1(tck)++εt. We can also notate as σ(u)=ReLU(u)=u+. Unknown parameters here are β0,,βk+1,c1,,ck and σ. This is a linear model for feature vector (1,t,(tc1)+,,(tck)+)T.

Rewrite (1.2): let xt=t, and μt for the mean of yt. Also denote rt=(σ(xtc1),,σ(xtck))T,st=(xtc1,,xtck)T. Now (1.2) becomes xt=t,st=(xtc1,,xtck)T,(1.3)rt=σ(st),μt=β0+βTrt,yt=μt+εt.
In words, xt is first converted to st by linear function, then we apply σ() to generate rt. Then μt is a linear function of rt which serves as the mean to yt.

1.2 AR

Consider AR model AR(1): xt=yt1. This is simply (1.1) with t replaced by xt=yt1: yt=β0+β1xt+εt,εti.i.dN(0,σ2).
One can create a nonlinear version by using (1.3) with xt=yt1. Call this NAR(1): xt=yt1,st=(xtc1,,xtck)T,(1.4)rt=σ(st),μt=β0+βTrt,yt=μt+εt.
Now let's consider AR(p): xt=(yt1,,ytp)T,(1.5)μt=β0+βTxt,yt=μt+εt. Note that (1.5) can be written in compressed form yt=β0+β1yt1++βpytp+εt.

Now we extend to nonlinear version like (1.4): xt=(yt1,,ytp)T,st=(xt1c1(1),,xt1ck(1),xt2c1(2),,xt2ck(2),,xtpc1(p),,xtpck(p))T,rt=σ(st),μt=β0+βTrt,yt=μt+εt.
Here xt1=yt1,xtp=ytp denote the components of xt. With this choice of st, note that μt=β0+βTrt=β0+βTσ(st)=β0+j=1pgj(xtj), where gj(x)=i=1kβi,jσ(xtjcj(i)).
In other words, we are fitting an additive model for yt in terms of covariates xt1=yt1,,xtp=ytp. However, it can't handle interactions between covariates.

Instead of the above additive model, we shall use NAR(p): xt=(yt1,,ytp)T,st=Wxt+b,(1.7)rt=σ(st),μt=β0+βTrt,yt=μt+εt.
Here WRk×p and bRk×1. Parameters here are W, b, β0, β and σ.

As for neural network, (1.7) is called a single-hidden layer neural network (see here).

2 RNNs

In (1.7), we do one modification: xt=(yt1,,ytp)T,st=Wrrt1+Wxt+b,(2.1)rt=σ(st),μt=β0+βTrt,yt=μt+εt.
Here rt involves not just current xt but also previous hidden layer output rt1 through Wrrt1, WrRk×k. Parameters now are Wr,W,b,β0,β and σ. Typically k>p. (2.1) also requires an initialization of rt, usually r0=0.

In (1.7), rt depends only on xt. While in (2.1), rt depends on xt,xt1,,x1. Now we assume xt are defined for all t=1,2, WLOG. To see how rt depends, note that r1=σ(Wx1+b),(r0=0),r2=σ(Wrσ(Wx1+b)+Wx2+b),r3=σ(Wrσ(Wrσ(Wx1+b)+Wx2+b)+Wx3+b),r4=σ(Wrσ(Wrσ(Wrσ(Wx1+b)+Wx2+b)+Wx3+b)+Wx4+b).
From the above, rt clearly depends on x1,,xt, but the strength varies.

RNN can have stability issues, causing either gratitude explosion or vanishing. To see this, observe that rtxu=σ(st)Wrσ(st1)Wrσ(su+1)Wrσ(su)W,ut.
To solve this, it is customary to take σ as the hyperbolic tangent function σtanh(u)=eueueu+eu.
Unlike ReLU, this function takes values between 1 and 1: σ(u)=1σ2(u)(0,1].

Further, we want rt to represent the ideal summary of x1,,xt. But here in RNN, rt only depends on the inputs closer to t. So RNN doesn't have a very long memory.