Continue with last model, we consider the model (plug in all possible 's):
Note that it is different from our last model:
The new model does not have parameters , so it is linear (discussed here). Also (2) will be used with a small , while (1) can have large . In short, (1) is a high-dimensional linear regression model, and (2) is a low-dimensional nonlinear regression model.
(1) has ample parameters so it is a flexible model.
Let denote the deterministic part in (1) (without ): Then (1) can be rewritten as
The noise term can have several interpretations.
When we see as trend, can be seen as random fluctuations.
When we see as actual data, can be seen as measurement error.
Plug in , we have Plug in , we have Similarly,
If we use this model to represent logarithm of population, then
represents difference between percentage change from to , and percentage change from to , for ;
represents the scale of data
represents the percentage change: let , then
So they have different scales.
2 Parameter Estimation
2.1 Unregularized MLE
For linear regression model (1), we can estimate as usual by MLE:
We have
However now the number of data points is equal to coefficients, so . The unbiased estimate will not exist because and .
So traditional estimates overfit the data.
2.2 Regularization
Now we add regularization. (ridge estimator) is the minimizer of
We also have LASSO estimator :
Correspondingly, plug in , RIDGE goes to Hodrick-Prescot Filter:
LASSO goes to -trend Filter:
Fact (Simple Ridge)
For , we can easily find the minimizer
We can rewrite this into matrix form: Take , then Let it be , we have Compared with regular , the only difference is the term.
For LASSO, .
Fact (Simple LASSO)
The minimizer of is given by
Proof
The derivative of is given by At , the function is not differentiable. We now need to set derivative to zero for , we have ; and for this to hold, we have to assume .
Similarly, for , , and we have to assume .
In the intermediate range , check that for and for .
3 Cross Validation for Selecting
We can use cross validation to select . First split the total set into . (say 80%: 20%). For this split, fit the model in , and obtain as minimizer of and as
Using these estimates, predict for :
Denote
Going over all splits, we can have the total test error:
We apply this to a set of candidate values and choose the optimal that minimizes the all split test error.