3.2 Score, Fisher Information

#FisherInformation #CRLB #ScoreFunction #DifferentialIdentity #KLDivergence #ExponentialFamily

1 Score Function

As a motivation, we consider the exponential family $p (x; θ) = e^{η (θ)^{T} T (x) - A (η (θ))} h (x), η : R \to R^{2}, Ξ = {η (θ) | θ \in R} .$ For this family,

$T (X)$ is complete sufficient.
$T (X)$ is minimal.
$P_{θ} (T (X) = t) = e^{θ^{T} t - A (θ)}$ .
$E_{θ} [T (X)] = A^{'} (θ)$ .

If $η (θ)$ is non-linear, then $T (X)$ is minimal. The tangent vector is $\dot{η} (θ_{0}) = \frac{d η}{d θ} (θ_{0})$ . So fix $θ_{0}$ , we define the tangent family $Ξ = {η (θ_{0}) + ε \dot{η} (θ_{0}) | ε \in R}$ , where density will become $\begin{aligned} q_{ε} (x) & = e^{(η (θ_{0}) + ε \dot{η} (θ_{0}))^{T} T (x) - A (η (θ_{0}) + ε \dot{η} (θ_{0}))} h (x) \\ = e^{ε \dot{η} (θ_{0})^{T} (T (x) - E_{θ_{0}} T (x)) - B (ε)} k (x) . \end{aligned}$
Here denote $S_{θ_{0}} (x) = \dot{η} (θ_{0})^{T} (T (x) - E_{θ_{0}} T (x))$ is complete sufficient for tangent family at $θ_{0}$ . We derive score function from here.

Assume $P$ has densities $p_{θ}$ w.r.t $μ$ , $Θ \subset R^{d}$ . The common support ${x | p_{θ} (x) > 0}$ is the same for all $θ$ . Recall $l (θ; x) = \log p_{θ} (x)$ .

Score Function

The score function is $\nabla l (θ; x)$ .

It plays a key role in many areas of statistics, especially in asymptotics.

We can think of as "local complete sufficient statistic", i.e. $p_{θ_{0} + η} (x) = e^{l (θ_{0} + η; x)} p_{θ_{0}} (x) \approx e^{η^{T} \nabla l (θ_{0}; x)} p_{θ_{0}} (x), η \approx 0.$

2 Differential Identities and Fisher Information

We assume as usual, enough regularity so that differentiation and integration are exchangeable.
Since $1 = \int e^{l (θ; x)} d μ (x),$ we take partial derivative w.r.t $θ_{j}$ : $\begin{matrix} (1.1) & 0 = \int \frac{\partial l}{\partial θ_{j}} (θ) e^{l (θ)} d μ = E_{θ} [\nabla l (θ; x)] . \end{matrix}$
(this is only true if $θ$ is the same). Then take partial derivative w.r.t $θ_{k}$ : $\begin{aligned} 0 & = \int (\frac{\partial^{2} l}{\partial θ_{j} \partial θ_{k}} (θ) + \frac{\partial l}{\partial θ_{j}} \frac{\partial l}{\partial θ_{k}} (θ)) e^{l (θ)} d μ \\ = E_{θ} [\frac{\partial^{2} l}{\partial θ_{j} \partial θ_{k}}] + E_{θ} [\frac{\partial l}{\partial θ_{j}} \frac{\partial l}{\partial θ_{k}}] . \end{aligned}$
The second term here is the covariance, which leads to $\begin{matrix} (1.2) & {Var}_{θ} [\nabla l (θ; x)] = - E_{θ} [\nabla^{2} l (θ; x)] = J (θ) . \end{matrix}$

Fisher Information

Define $J (θ) = {Var}_{θ} [\nabla l (θ; x)] = E_{θ} [- \nabla^{2} l (θ; x)]$ as the Fisher information.

It is possible to extend this definition to certain cases where $l$ is not even differentiable, like Laplace location family, but for our purposes we can just assume "sufficient regularity".

Now we try with another statistic $δ (X)$ . Let $g (θ) = E_{θ} [δ (X)] = \int δ e^{l (θ)} d μ$ (i.e. $δ$ is an unbiased estimator), then $\nabla g (θ) = \int δ \nabla l e^{l} d μ = E_{θ} [δ (X) \nabla l (θ; X)] = {Cov}_{θ} (δ (X), \nabla l (θ; X)) .$ (Since $E \nabla l = 0$ .)

3 CRLB

Now we combine the results with Cauchy-Schwarz inequality.

When $d = 1$ , $\begin{aligned} {Var}_{θ} (δ) \cdot {Var}_{θ} (\dot{l} (θ; x)) \geq {Cov}_{θ} (δ, \dot{l} (θ; x))^{2} \\ (2.1) & \Rightarrow & {Var}_{θ} (δ) \geq \frac{\dot{g} (θ)^{2}}{J (θ)} . \end{aligned}$
When $d > 1$ , $θ \in R^{d}$ , $g (θ), δ (x) \in R$ , $\begin{matrix} (2.2) & {Var}_{θ} (δ) \geq \nabla g (θ)^{T} J (θ)^{- 1} \nabla g (θ) . \end{matrix}$

Proof

By Cauchy-Schwarz inequality, $\forall a \in R^{d}$ , $\begin{aligned} {Var}_{θ} (δ) \cdot a^{T} J (θ) a & = {Var}_{θ} (δ) Var (a^{T} \nabla l (θ)) \\ \geq {Cov}_{θ} (δ, a^{T} \nabla l (θ))^{2} = a^{T} \nabla g \nabla g^{T} a \\ \Rightarrow {Var}_{θ} (δ) & \geq max_{a \neq 0} \frac{a^{T} \nabla g \nabla g^{T} a}{a^{T} J (θ) a} = \nabla g^{T} J (θ)^{- 1} \nabla g . \end{aligned}$

Interpretation: if $g (θ)$ is estimand, no unbiased estimator can have smaller variance than $\nabla g (θ)^{T} J (θ)^{- 1} \nabla g (θ)$ .

The upper bound combined, is called Cramer-Rao Lower Bound (CRLB).

Example (IID Sample)

$X_{1}, \dots, X_{n} \overset{i . i . d}{\sim} p_{θ}^{(1)} (x), θ \in Θ \subset R^{d}$ . $p_{θ}$ is "regular" (common support, finite derivative w.r.t $θ$ ). $X \sim p_{θ} (x) = \prod_{i = 1}^{n} p_{θ}^{(1)} (x_{i})$ .
Let $l_{1} (θ; x_{i}) = \log p_{θ} (x_{i}), l (θ; x) = \sum_{i = 1}^{n} l_{1} (θ; x_{i})$ . Then $J (θ) = {Var}_{θ} (\nabla l (θ; x)) = {Var}_{θ} (\sum_{i = 1}^{n} \nabla l_{1} (θ; x_{i})) = n J_{1} (θ),$ so lower bound scales like $n^{- 1}$ .

4 Efficiency

CRLB is not necessarily attainable. We define the efficiency of an unbiased estimator as ${eff}_{θ} (δ) = \frac{CRLB}{{Var}_{θ} (δ)} \leq 1.$
If $g (θ) = θ$ , by (2.1), $CRLB = \frac{1}{J (θ)}$ , so ${eff}_{θ} (δ) = \frac{1}{J (θ) {Var}_{θ} (δ)} .$
We say $δ (X)$ is efficient if ${eff}_{θ} (δ) = 1, \forall θ$ .
Plug in CRLB's expression: ${eff}_{θ} (δ (X)) = \frac{{Cov}^{2} (δ (X), \dot{l} (θ, X))}{{Var}_{θ} (δ (X)) {Var}_{θ} (\dot{l} (θ))} = {Corr}_{θ}^{2} (δ (X), \dot{l} (θ; X)) .$
So $δ (X)$ is efficient iff ${Corr}_{θ} (δ (X), \nabla l (θ; X)) = 1, \forall θ$ .

This is rarely achieved in finite samples but we can approach it asymptotically as $n \to \infty$ .

Example (Exponential Family)

$p_{η} (x) = e^{η^{T} T (x) - A (η)} h (x)$ , so $l (η; x) = η^{T} T (x) - A (η) + \log h (x)$ , so $\nabla l (η; x) = T (x) - \nabla A (η) = T (x) - E_{η} T (x) .$ Then ${\begin{aligned} \nabla^{2} l (η; x) = - \nabla^{2} A (η) = E_{η} [- \nabla^{2} l (η; x)], \\ {Var}_{η} (\nabla l (η; x)) = {Var}_{η} (T (x)) = \nabla^{2} A (η) . \end{aligned}$
(Note that $E_{θ} T (x)$ is non-random.) So any unbiased estimator of $η$ has ${Var}_{η} (δ) \geq \nabla^{2} A (η)^{- 1} .$ #？

Example (Curved Family)

For $p_{θ} (x) = e^{η (θ)^{T} T (x) - B (θ)} h (x)$ , $θ \in R$ , $B (θ) = A (η (θ))$ . Then $\begin{aligned} l (θ; x) & = η (θ)^{T} T (x) - B (θ) + \log h (x) \\ \dot{l} (θ; x) & = \dot{η} (θ)^{T} T (x) - \dot{η} (θ)^{T} \nabla_{η} A (η (θ)) \\ = \dot{η} (θ)^{T} (T (x) - \nabla_{η} A (η (θ))) \\ = \dot{η} (θ)^{T} (T (x) - E_{θ} T (x)), \end{aligned}$ so $\dot{η} (θ)^{T} T (x)$ is "locally complete sufficient statistic".

5 Fisher Information as Local Metric

Recall KL divergence $D_{KL} (p | | q) = E_{p} [\log p (X) - \log q (X)] = \int \log (\frac{p}{q}) p d μ$ is used to describe distance between two distributions.
For parametric model, $D_{KL} (θ^{*} | | θ) = D_{KL} (p_{θ^{*}} | | p_{θ}) = \int (l (θ^{*}) - l (θ)) e^{l (θ^{*})} d μ .$ Since $\begin{aligned} \frac{\partial}{\partial θ_{j}} D_{KL} (θ^{*} | | θ) & = - \int \frac{\partial l}{\partial θ_{j}} (θ) e^{l (θ^{*})} d μ = 0 at θ = θ^{*}, \\ \frac{\partial^{2}}{\partial θ_{j} \partial θ_{k}} D_{KL} (θ^{*} | | θ) & = - \int \frac{\partial^{2} l}{\partial θ_{j} \partial θ_{k}} (θ) e^{l (θ^{*})} d μ \\ = J (θ^{*})_{j k} > 0 at θ = θ^{*}, \end{aligned}$ KL divergence is maximized at $θ = θ^{*}$ .

We can use Taylor expansion to show that $D_{KL} (θ^{*} | | θ) \approx \frac{(θ - θ^{*})^{2}}{2} J (θ^{*}), θ \approx θ^{*} .$