3.2 Score, Fisher Information

1 Score Function

As a motivation, we consider the exponential family p(x;θ)=eη(θ)TT(x)A(η(θ))h(x),η:RR2,Ξ={η(θ)|θR}.For this family,

If η(θ) is non-linear, then T(X) is minimal. The tangent vector is η˙(θ0)=dηdθ(θ0). So fix θ0, we define the tangent family Ξ={η(θ0)+εη˙(θ0)|εR}, where density will become qε(x)=e(η(θ0)+εη˙(θ0))TT(x)A(η(θ0)+εη˙(θ0))h(x)=eεη˙(θ0)T(T(x)Eθ0T(x))B(ε)k(x).
Here denote Sθ0(x)=η˙(θ0)T(T(x)Eθ0T(x)) is complete sufficient for tangent family at θ0. We derive score function from here.

Assume P has densities pθ w.r.t μ, ΘRd. The common support {x|pθ(x)>0} is the same for all θ. Recall l(θ;x)=logpθ(x).

Score Function

The score function is l(θ;x).

It plays a key role in many areas of statistics, especially in asymptotics.

We can think of as "local complete sufficient statistic", i.e. pθ0+η(x)=el(θ0+η;x)pθ0(x)eηTl(θ0;x)pθ0(x),η0.

2 Differential Identities and Fisher Information

We assume as usual, enough regularity so that differentiation and integration are exchangeable.
Since 1=el(θ;x)dμ(x), we take partial derivative w.r.t θj: (1.1)0=lθj(θ)el(θ)dμ=Eθ[l(θ;x)].
(this is only true if θ is the same). Then take partial derivative w.r.t θk: 0=(2lθjθk(θ)+lθjlθk(θ))el(θ)dμ=Eθ[2lθjθk]+Eθ[lθjlθk].
The second term here is the covariance, which leads to (1.2)Varθ[l(θ;x)]=Eθ[2l(θ;x)]=J(θ).

Fisher Information

Define J(θ)=Varθ[l(θ;x)]=Eθ[2l(θ;x)] as the Fisher information.

It is possible to extend this definition to certain cases where l is not even differentiable, like Laplace location family, but for our purposes we can just assume "sufficient regularity".

Now we try with another statistic δ(X). Let g(θ)=Eθ[δ(X)]=δel(θ)dμ (i.e. δ is an unbiased estimator), then g(θ)=δleldμ=Eθ[δ(X)l(θ;X)]=Covθ(δ(X),l(θ;X)). (Since El=0.)

3 CRLB

Now we combine the results with Cauchy-Schwarz inequality.

Interpretation: if g(θ) is estimand, no unbiased estimator can have smaller variance than g(θ)TJ(θ)1g(θ).

The upper bound combined, is called Cramer-Rao Lower Bound (CRLB).

4 Efficiency

CRLB is not necessarily attainable. We define the efficiency of an unbiased estimator as effθ(δ)=CRLBVarθ(δ)1.
If g(θ)=θ, by (2.1), CRLB=1J(θ), so effθ(δ)=1J(θ)Varθ(δ).
We say δ(X) is efficient if effθ(δ)=1,θ.
Plug in CRLB's expression: effθ(δ(X))=Cov2(δ(X),l˙(θ,X))Varθ(δ(X))Varθ(l˙(θ))=Corrθ2(δ(X),l˙(θ;X)).
So δ(X) is efficient iff Corrθ(δ(X),l(θ;X))=1,θ.

This is rarely achieved in finite samples but we can approach it asymptotically as n.

5 Fisher Information as Local Metric

Recall KL divergence DKL(p||q)=Ep[logp(X)logq(X)]=log(pq)pdμ is used to describe distance between two distributions.
For parametric model, DKL(θ||θ)=DKL(pθ||pθ)=(l(θ)l(θ))el(θ)dμ. Since θjDKL(θ||θ)=lθj(θ)el(θ)dμ=0 at θ=θ,2θjθkDKL(θ||θ)=2lθjθk(θ)el(θ)dμ=J(θ)jk>0 at θ=θ, KL divergence is maximized at θ=θ.

We can use Taylor expansion to show that DKL(θ||θ)(θθ)22J(θ),θθ.