As a motivation, we consider the exponential family For this family,
is complete sufficient.
is minimal.
.
.
If is non-linear, then is minimal. The tangent vector is . So fix , we define the tangent family, where density will become
Here denote is complete sufficient for tangent family at . We derive score function from here.
Assume has densities w.r.t , . The common support is the same for all . Recall .
Score Function
The score function is .
It plays a key role in many areas of statistics, especially in asymptotics.
We can think of as "local complete sufficient statistic", i.e.
2 Differential Identities and Fisher Information
We assume as usual, enough regularity so that differentiation and integration are exchangeable.
Since we take partial derivative w.r.t :
(this is only true if is the same). Then take partial derivative w.r.t :
The second term here is the covariance, which leads to
Fisher Information
Define as the Fisher information.
It is possible to extend this definition to certain cases where is not even differentiable, like Laplace location family, but for our purposes we can just assume "sufficient regularity".
Now we try with another statistic . Let (i.e. is an unbiased estimator), then (Since .)
Interpretation: if is estimand, no unbiased estimator can have smaller variance than .
The upper bound combined, is called Cramer-Rao Lower Bound (CRLB).
Example (IID Sample)
. is "regular" (common support, finite derivative w.r.t ). .
Let . Then so lower bound scales like .
4 Efficiency
CRLB is not necessarily attainable. We define the efficiency of an unbiased estimator as
If , by (2.1), , so
We say is efficient if .
Plug in CRLB's expression:
So is efficient iff .
This is rarely achieved in finite samples but we can approach it asymptotically as .
Example (Exponential Family)
, so , so Then
(Note that is non-random.) So any unbiased estimator of has #?
Example (Curved Family)
For , , . Then so is "locally complete sufficient statistic".
5 Fisher Information as Local Metric
Recall KL divergence is used to describe distance between two distributions.
For parametric model, Since KL divergence is maximized at .