1 Simple Linear Regression

#LinearRegression #Transformation #BayesianInference #MLE

1 Setting

Setting: $(x_{1}, y_{1}), \dots, (x_{n}, y_{n})$ .
For a time series setting, for example, we may take $x_{i} = i$ as the index, and $y_{i}$ be the population at a certain time.
For a linear model, we assume $y_{i} = β_{0} + β_{1} x_{i} + ε_{i},$ where $β_{0}$ is the intercept, $β_{1}$ is the slope, $ε_{i}$ is the error term, and assume $ε_{i} \overset{i . i . d}{\sim} N (0, σ^{2})$ .
The parameters here are $β_{0}, β_{1}, σ$ . ( $σ$ measures the noise scale). We want to do estimation and uncertainty quantification.

2 Frequentist Inference

The frequentist approach is the MLE . The basic process is

Calculate MLE, and get estimated parameters
Determine the distribution of the MLE
Confidence intervals

Step 1: The likelihood for a single observation is $f_{(x_{i}, y_{i})} (x_{i}, y_{i}) = f_{y_{i} | x_{i}} (y_{i}) f_{x_{i}} (x_{i}) \propto \frac{1}{\sqrt{2 π} σ} \exp [- \frac{(y_{i} - β_{0} - β_{1} x_{i})^{2}}{2 σ^{2}}] .$ (From the model $y_{i} | x_{i} \sim N (β_{0} + β_{1} x_{i}, σ^{2})$ ) so the general likelihood is $\begin{aligned} \prod_{i = 1}^{n} \frac{1}{\sqrt{2 π} σ} \exp [- \frac{(y_{i} - β_{0} - β_{1} x_{i})^{2}}{2 σ^{2}}] \\ = & (2 π)^{- \frac{n}{2}} σ^{- n} \exp [- \sum_{i = 1}^{n} \frac{(y_{i} - β_{0} - β_{1} x_{i})^{2}}{2 σ^{2}}] . \end{aligned}$
If $({\hat{β}}_{0}, {\hat{β}}_{1})$ minimize $S (β_{0}, β_{1}) = \sum_{i = 1}^{n} (y_{i} - β_{0} - β_{1} x_{i})^{2}$ , then by taking derivative we have ${\hat{β}}_{1} = \frac{\sum_{i = 1}^{n} (x_{i} - \overset{―}{x}) (y_{i} - \overset{―}{y})}{\sum_{i = 1}^{n} (x_{i} - \overset{―}{x})^{2}} = \frac{\sum_{i = 1}^{n} (x_{i} - \overset{―}{x}) y_{i}}{\sum_{i = 1}^{n} (x_{i} - \overset{―}{x})^{2}}, {\hat{β}}_{0} = \overset{―}{y} - {\hat{β}}_{1} \overset{―}{x} .$
Then MLE for $σ = {\hat{σ}}_{MLE} = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} (y_{i} - {\hat{β}}_{0} - {\hat{β}}_{1} x_{i})^{2}} = \sqrt{\frac{S ({\hat{β}}_{0}, \hat{β_{1}})}{n}}$ .
Step 2: for ${\hat{β}}_{1}$ , taking all $x_{i}$ as deterministic, since $y_{i} \sim N (β_{0} + β_{1} x_{i}, σ^{2})$ , we have $\begin{aligned} {\hat{β}}_{1} & \sim N (\frac{\sum_{i = 1}^{n} (x_{i} - \overset{―}{x}) (β_{0} + β_{1} x_{i})}{\sum_{i = 1}^{n} (x_{i} - \overset{―}{x})^{2}}, \frac{\sum_{i = 1}^{n} (x_{i} - \overset{―}{x})^{2} σ^{2}}{{(\sum_{i = 1}^{n} (x_{i} - \overset{―}{x})^{2})}^{2}}) \\ = N (β_{1}, \frac{σ^{2}}{\sum_{i = 1}^{n} (x_{i} - \overset{―}{x})^{2}}) . \end{aligned}$

Help

Note that $\begin{aligned} \frac{\sum_{i = 1}^{n} (x_{i} - \overset{―}{x}) (β_{0} + β_{1} x_{i})}{\sum_{i = 1}^{n} (x_{i} - \overset{―}{x})^{2}} & = \frac{β_{0} \sum_{i = 1}^{n} (x_{i} - \overset{―}{x})}{\sum_{i = 1}^{n} (x_{i} - \overset{―}{x})^{2}} + \frac{β_{1} \sum_{i = 1}^{n} (x_{i} - \overset{―}{x}) (x_{i} - \overset{―}{x})}{\sum_{i = 1}^{n} (x_{i} - \overset{―}{x})^{2}} = β_{1} . \end{aligned}$ Again because $\sum_{i = 1}^{n} (x_{i} - \overset{―}{x}) = 0$ .

Step 3: ${\hat{β}}_{0}, {\hat{β}}_{1}$ are unbiased, but $\hat{σ}$ is biased: $E ({\hat{σ}}^{2}) = σ^{2} \cdot \frac{n - 2}{n}$ . Based on this, ${\hat{σ}}_{unbiased}^{2} = \frac{n}{n - 2} {\hat{σ}}_{MLE}^{2} = \frac{S (\hat{β_{0}}, \hat{β_{1}})}{n - 2}$ .
Now we have $\frac{\sqrt{\sum_{i = 1}^{n} (x_{i} - \overset{―}{x})^{2}} ({\hat{β}}_{1} - β_{1})}{σ} \sim N (0, 1)$ , and $\frac{\sqrt{\sum_{i = 1}^{n} (x_{i} - \overset{―}{x})^{2}} ({\hat{β}}_{1} - β_{1})}{{\hat{σ}}_{unbiased}} \sim t_{n - 2}$ , so 95% interval is $[{\hat{β}}_{1} - \frac{t_{0.025} {\hat{σ}}_{unbiased}}{\sqrt{\sum_{i = 1}^{n} (x_{i} - \overset{―}{x})^{2}}}, {\hat{β}}_{1} + \frac{t_{0.025} {\hat{σ}}_{unbiased}}{\sqrt{\sum_{i = 1}^{n} (x_{i} - \overset{―}{x})^{2}}}] .$

3 Bayesian Inference

Bayesian involves posterior density $f_{(β_{0}, β_{1}, σ) | (x_{i}, y_{i})} (β_{0}, β_{1}, σ) \propto \underset{likelihood}{\underset{⏟}{f_{(x_{i}, y_{i}) | (β_{0}, β_{1}, σ)} (x_{i}, y_{i})}} \underset{prior}{\underset{⏟}{f_{β_{0}, β_{1}, σ} (β_{0}, β_{1}, σ)}} .$
The additional information we need to provide is the prior. We can assume $β_{0}, β_{1}, \log σ \overset{i . i . d}{\sim} Uniform [- C, C] .$
By transformation formula, $f_{σ} (σ) = f_{\log σ} (\log σ) \frac{1}{σ}$ . So the prior density is $\begin{aligned} f_{β_{0}, β_{1}, σ} (β_{0}, β_{1}, σ) = f_{β_{0}} (β_{0}) f_{β_{1}} (β_{1}) f_{σ} (σ) \\ = & \frac{I {- C < β_{0} < C}}{2 C} \frac{I {- C < β_{1} < C}}{2 C} \frac{I {- C < \log σ < C}}{2 C σ} \\ \propto & \frac{1}{σ} I {- C < β_{0}, β_{1}, \log σ < C} . \end{aligned}$
So $\begin{aligned} f_{β_{0}, β_{1}, σ | data} (β_{0}, β_{1}, σ) & \propto (2 π)^{- \frac{n}{2}} σ^{- n} \exp [- \frac{S (β_{0}, β_{1})}{2 σ^{2}}] \frac{I {- C < β_{0}, β_{1}, \log σ < C}}{σ} \\ \propto σ^{- n - 1} \exp [- \frac{S (β_{0}, β_{1})}{2 σ^{2}}] I {- C < β_{0}, β_{1}, \log σ < C} . \end{aligned}$
If we want to get posterior for $β_{0}, β_{1} | data$ , we integrate over $σ$ : $\begin{aligned} f_{β_{0}, β_{1} | data} (β_{0}, β_{1}) = \int f_{β_{0}, β_{1}, σ | data} (β_{0}, β_{1}, σ) d σ \\ \propto & 1 {- C < β_{0}, β_{1} < C} \int_{e^{- C}}^{e^{C}} σ^{- n - 1} \exp (- \frac{S (β_{0}, β_{1})}{2 σ^{2}}) d σ . \end{aligned}$
When $C$ is large, $(e^{- C}, e^{C})$ goes to $(0, \infty)$ . Use change of variable $s = \frac{σ}{\sqrt{S (β_{0}, β_{1})}},$ then the integral is $\begin{aligned} \int_{0}^{\infty} σ^{- n - 1} \exp (- \frac{S (β_{0}, β_{1})}{2 σ^{2}}) d σ \\ = & S (β_{0}, β_{1})^{- \frac{n}{2}} \int_{0}^{\infty} s^{- n - 1} \exp (- \frac{1}{2 s^{2}}) d s \propto_{β} S (β_{0}, β_{1})^{- \frac{n}{2}}, \end{aligned}$ and $\begin{matrix} (3.1) & f_{β_{0}, β_{1} | data} (β_{0}, β_{1}) \propto 1 {- C < β_{0}, β_{1} < C} S (β_{0}, β_{1})^{- \frac{n}{2}} . \end{matrix}$
Since $S (β_{0}, β_{1})$ in practical problems are extremely large, to handle numerical issue, we rewrite as $\begin{matrix} (3.2) & f_{β_{0}, β_{1} | data} (β_{0}, β_{1}) \propto {(\frac{S ({\hat{β}}_{0}, {\hat{β}}_{1})}{S (β_{0}, β_{1})})}^{\frac{n}{2}} 1 {- C < β_{0}, β_{1} < C} . \end{matrix}$
The density will be concentrated around values s.t. $S (β_{0}, β_{1})$ is close to $S ({\hat{β}}_{0}, {\hat{β}}_{1})$ .