Consider the simplest linear regression model:
Then consider nonlinear model ( change of slope model): We can also notate as . Unknown parameters here are and . This is a linear model for feature vector.
Rewrite (1.2): let , and for the mean of . Also denote Now (1.2) becomes
In words, is first converted to by linear function, then we apply to generate . Then is a linear function of which serves as the mean to .
1.2 AR
Consider AR model: . This is simply (1.1) with replaced by :
One can create a nonlinear version by using (1.3) with . Call this :
Now let's consider : Note that (1.5) can be written in compressed form .
Now we extend to nonlinear version like (1.4):
Here denote the components of . With this choice of , note that where .
In other words, we are fitting an additive model for in terms of covariates . However, it can't handle interactions between covariates.
Instead of the above additive model, we shall use :
Here and . Parameters here are , , , and .
As for neural network, (1.7) is called a single-hidden layer neural network (see here).
2 RNNs
In (1.7), we do one modification:
Here involves not just current but also previous hidden layer output through , . Parameters now are and . Typically . (2.1) also requires an initialization of , usually .
In (1.7), depends only on . While in (2.1), depends on . Now we assume are defined for all WLOG. To see how depends, note that
From the above, clearly depends on , but the strength varies.
RNN can have stability issues, causing either gratitude explosion or vanishing. To see this, observe that
To solve this, it is customary to take as the hyperbolic tangent function.
Unlike ReLU, this function takes values between and : .
Further, we want to represent the ideal summary of . But here in RNN, only depends on the inputs closer to . So RNN doesn't have a very long memory.