2.1 Exponential Families

#ExponentialFamily #SufficientStatistic #DifferentialIdentity #NaturalParameter #MGF #CGF #Logit

1 Definition

We have discussed statistical models. Now we want to study a family of models with a specific structure, and they have a lot of properties.

Exponential Family

$P = {P_{η} | η \in Ξ}$ is a $s -$ parameter exponential family if it is defined by a family of densities of the form $\begin{matrix} (1.1) & p_{η} (x) = e^{η^{T} T (x) - A (η)} h (x), \end{matrix}$ w.r.t a common dominating measure $μ$ ( $P_{η} ≪ μ, \forall η \in Ξ$ )
The parts in the formula has distinct names:

$T : X \to R^{s}$ : sufficient statistic,
$h : X \to [0, \infty)$ : carrier density/base density,
$η \in Ξ \subset R^{s}$ : natural parameter,
$A : Ξ \to R$ : log-partition function.

By definition, take integral on both sides w.r.t $x$ over $X$ : $\begin{aligned} 1 = \int_{X} e^{η^{T} T (x) - A (η)} h (x) d μ (x) \\ (1.2) & \Rightarrow & A (η) = \log (\int_{X} e^{η^{T} T (x)} h (x) d μ (x)) \leq \infty . \end{aligned}$
We should restrict $A (η) < \infty$ , so define natural parameter space $Ξ_{1} = {η | A (η) < \infty} \subset R^{s} .$
We can prove that $A (η)$ is convex function, so $Ξ_{1}$ is convex set.

Example

Poisson distribution: recall for $X \sim Poisson (λ)$ , it has density $p_{λ} (x) = \frac{λ^{x} e^{- λ}}{x!} = \exp {(\log λ) x - λ} \frac{1}{x!}, x \in N .$ So take $η = \log λ$ , and $p_{η}$ is an exponential family with sufficient statistic $T (x) = x$ , base density $h (x) = \frac{1}{x!}$ and log-partition function $A (η) = e^{η}$ .
The way of representing is not unique. Say $T (x) = \frac{x}{2}, η = 2 \log λ, A (η) = e^{\frac{η}{2}}$ ; or $T (x) = x + 1, A (η) = e^{η} + η$ .
Poisson distribution with size $n$ :
$X_{1}, \dots, X_{n} \overset{i . i . d}{\sim} P_{η}^{(1)} (x) = \exp {(\log λ) \sum_{i = 1}^{n} x_{i} - n λ} \frac{1}{\prod_{i = 1}^{n} x_{i}!}$ . So $η = \log λ, T (x) = \sum_{i = 1}^{n} x_{i}, A (η) = n e^{η}, h (x) = \prod_{i = 1}^{n} h^{(1)} (x_{i})$ .

1.1 Distribution of $T (X)$

If $X \sim p_{η} (x) = e^{η^{T} T (x) - A (η)}$ w.r.t $μ$ (WLOG we let $h \equiv 1$ , otherwise we let $μ$ to absorb $h$ ), then $T (X) \sim q_{η} (t) = e^{η^{T} t - A (η)}$ w.r.t $ν$ , where $ν$ is the measure $μ$ push forward through $T : X \to R^{s}$ : $ν (B) \overset{Δ}{=} μ ({x : T (x) \in B}) .$ So $\begin{aligned} P_{η} (T (X) \in B) & = \int 1_{B} (T (X)) e^{η^{T} T (x) - A (η)} d μ (x) \\ = \int 1_{B} (t) e^{η^{T} t - A (η)} d ν (t) . \end{aligned}$
This is simplest in discrete case (we can now drop $h \equiv 1$ assumption):

\begin{aligned} P_{η} (T (x) = t) & = \sum_{T (x) = t} e^{η^{T} T (x) - A (η)} h (x) μ ({x}) \\ = e^{η^{T} t - A (η)} \sum_{T (x) = t} h (x) μ ({x}), \end{aligned}

and we denote $ν ({t}) = \sum_{x : T (x) = t} h (x) μ ({x}) .$

1.2 Carnonical Form

Based on discussion above, we can simplify the structure of exponential family:

$T (x) \to x$ (by sufficiency reduction discussed above)
$h (x) \to 1$ (by absorbing $h$ into $μ$ )
$θ = η$ (by parameterizing by $η$ )

Based on these, we define

Carnonical Form

$p_{η} (x) = e^{η^{T} x - A (η)}$ is called carnonical form.

2 Differential Identites

By (1.2) we have $\begin{matrix} (2.1) & e^{A (η)} = \int_{X} e^{η^{T} T (x)} h (x) d μ (x) . \end{matrix}$ We can differentiate this function to get meaningful results. We use without proof that it's correct to swap differentiation and integral within $Ξ_{1}$ .

2.1 Mean of $T (X)$

Denote $T_{j} (x)$ be the $j$ th coordinate of $T (x)$ . Then $\begin{aligned} \frac{\partial}{\partial η_{j}} e^{A (η)} & = \int_{X} \frac{\partial}{\partial η_{j}} e^{η^{T} T (x)} h (x) d μ (x) \\ \Rightarrow e^{A (η)} \frac{\partial A}{\partial η_{j}} (η) & = \int_{X} T_{j} (x) e^{η^{T} T (x)} h (x) d μ (x) \\ \Rightarrow \frac{\partial A}{\partial η_{j}} (η) & = \int_{X} T_{j} (x) e^{η^{T} T (x) - A (η)} h (x) d μ (x) = E_{η} [T_{j} (X)] . \end{aligned}$ Rearrange for $j = 1, \dots, s$ : $\begin{matrix} (2.2) & \nabla A (η) = E_{η} [T (X)] . \end{matrix}$

2.2 Variance of $T (X)$

Take a second partial derivative: $\begin{aligned} \frac{\partial^{2}}{\partial η_{j} \partial η_{k}} e^{A (η)} & = \int_{X} \frac{\partial^{2}}{\partial η_{j} \partial η_{k}} e^{η^{T} T (x)} h (x) d μ (x) \\ \Rightarrow e^{A (η)} (\frac{\partial^{2} A}{\partial η_{j} \partial η_{k}} + \frac{\partial A}{\partial η_{j}} \frac{\partial A}{\partial η_{k}}) & = \int_{X} T_{j} (x) T_{k} (x) e^{η^{T} T (x)} h (x) d μ (x) \\ \Rightarrow \frac{\partial^{2} A}{\partial η_{j} \partial η_{k}} + E_{η} [T_{j} (X)] E_{η} [T_{k} (X)] & = E_{η} [T_{j} (X) T_{k} (X)] \\ \Rightarrow \frac{\partial^{2} A}{\partial η_{j} \partial η_{k}} & = {Cov}_{η} (T_{j} (X), T_{k} (X)) . \end{aligned}$ Finally we get $\begin{matrix} (2.3) & \nabla^{2} A (η) = {Var}_{η} (T (X)) . \end{matrix}$ Here ${Var}_{η} (T (X))$ is a $s \times s$ covariance matrix of the random vector $T (X)$ .

Example

As we have shown above, in the Poisson exponential family $T (X) = X$ , $η = \log λ$ , $A (η) = e^{η} = λ$ . So $\begin{aligned} E_{η} (X) & = \frac{d}{d η} e^{η} = e^{η} = λ, \\ {Var}_{η} (X) & = \frac{d^{2}}{d η^{2}} e^{η} = e^{η} = λ . \end{aligned}$

2.3 MGF of $T (X)$

Moment Generating Function (MGF) of a $d$ dimensional random vector $X \sim P$ is defined as $M_{X} (u) = E [e^{u^{T} X}], u \in R^{d}$ . Note that 1-dim case is introduced in here. We can calculate moments by taking derivative of MGF, as long as $M_{X} (u)$ is well-defined for a neighborhood of $0$ . Now we evaluate the first moments $\begin{array}{r} \frac{\partial}{\partial u_{j}} M_{X} (u) = \int_{X} \frac{\partial}{\partial u_{j}} e^{u^{T} x} d P (x) = \int_{X} x_{j} e^{u^{T} x} d P (x) . \end{array}$
Let $u = 0$ , we obtain $\frac{\partial}{\partial u_{j}} M_{X} (0) = \int_{X} x_{j} d P (x) = E [X_{j}] .$ Similarly $\begin{aligned} {\frac{\partial^{m_{1} + \dots + m_{d}}}{\partial u_{1}^{m_{1}} \dots \partial u_{d}^{m_{d}}} M_{X} (u) |}_{u = 0} \\ (2.4) & = & {\int_{X} x_{1}^{m_{1}} \dots x_{d}^{m_{d}} e^{u^{T} x} d P (x) |}_{u = 0} = E [X_{1}^{m_{1}} \dots X_{d}^{m_{d}}] . \end{aligned}$
On the other hand, given $η, P_{η}$ , we can explicitly calculate the MGF for exponential family $\begin{aligned} M_{T (X)} (u) & = E_{η} [e^{u^{T} T (X)}] = \int_{X} e^{u^{T} T (x) - A (η)} h (x) d μ (x) \\ = e^{- A (η)} \int_{X} e^{(u + η)^{T} T (x)} h (x) d μ (x) = e^{A (η + u) - A (η)} . \end{aligned}$

Example

Plug in this formula to the Poisson distribution, with $η = \log λ$ , we have $M_{X} (u) = \exp {e^{η + u} - e^{η}} = \exp {λ (e^{u} - 1)} .$
To show it is useful, suppose $X_{i} \sim Poisson (λ)$ , independent, and we want to determine distribution of $X_{+} = \sum_{i = 1}^{n} X_{i}$ . Then $M_{X_{+}} (u) = \prod_{i = 1}^{n} M_{X_{i}} (u) = \exp {\sum_{i = 1}^{n} λ_{i} (e^{u} - 1)} .$
As a result, we have $X_{+} \sim Poisson (\sum_{i = 1}^{n} λ_{i})$ .

2.4 CGF

The cumulant-generating function (CGF) is the log of MGF: $K_{X} (u) = \log M_{X} (u)$ . So for exponential family, $\begin{aligned} K_{T (X)} (u) & = A (η + u) - A (η) \\ \Rightarrow {\frac{\partial}{\partial η_{j}} K_{T} (u) |}_{u = 0} & = 0. \end{aligned}$

3 Other Parameterizations

Instead of parameterizing $P$ w.r.t $η$ , we can parameterize the family by another $η = η (θ)$ , so $p_{θ} (x) = e^{η (θ)^{T} T (x) - B (θ)} h (x), B (θ) = A (η (θ)) .$

Example

Poisson distribution: if indexed by the mean $λ$ , with $η (λ) = \log λ$ and $B (λ) = λ$ is an example.
Normal: $X \sim N (μ, σ^{2})$ . With the usual parameter vector $θ = (μ, σ^{2})$ , $\begin{aligned} p_{θ} (x) & = \frac{1}{\sqrt{2 π σ^{2}}} \exp {- \frac{(μ - x)^{2}}{2 σ^{2}}} \\ = \exp {\frac{μ}{σ^{2}} x - \frac{1}{2 σ^{2}} x^{2} - \frac{μ}{2 σ^{2}} - \frac{1}{2} \log (2 π σ^{2})}, \end{aligned}$
so it is an exponential family with $T (x) = (x, x^{2})$ , $η (θ) = (\frac{μ}{σ^{2}}, - \frac{1}{2 σ^{2}})$ , $h (x) = 1$ , and $B (θ) = \frac{μ^{2}}{2 σ^{2}} + \frac{1}{2} \log (2 π σ^{2})$ .
Or we can rewrite $η_{1} = \frac{μ^{2}}{2 σ^{2}}, η_{2} = - \frac{1}{2} σ^{2}$ to complete the natural parameterization $p_{η} (x) = e^{η^{T} T (x) - A (η)}, A (η) = - \frac{η_{1}^{2}}{4 η_{2}} + \frac{1}{2} \log (- \frac{π}{η_{2}}) .$
Binomial: $X \sim Binomial (n, θ)$ . So $\begin{aligned} p_{θ} (x) & = θ^{x} (1 - θ)^{n - x} (\binom{n}{x}) \\ = \exp {x \log θ + (n - x) \log (1 - θ)} (\binom{n}{x}) \\ = \exp {x \log (\frac{θ}{1 - θ}) - n \log (1 - θ)} (\binom{n}{x}), \end{aligned}$ so we can take $T (x) = x, η = \log (\frac{θ}{1 - θ})$ .

$η = \log (\frac{θ}{1 - θ})$ here is called logit/log-odds.

Beta: $X \sim Beta (α, β)$ , then $\begin{aligned} p_{α, β} (x) & = \frac{x^{α - 1} (1 - x)^{β - 1}}{B (α, β)} \\ = \exp {α \log x + β \log (1 - x) - \log B (α, β)} \cdot \frac{1}{x (1 - x)}, \end{aligned}$ where $B (α, β) = \int_{0}^{1} t^{α - 1} (1 - t)^{β - 1}$ is called beta function. So we can take $T (x) = (\log x, \log (1 - x)), η = (α, β), h (x) = \frac{1}{x (1 - x)}$ .

4 Interpretation: Exponential Tilting

We can think of $p_{η} (x) = e^{η^{T} T (x) - A (η)} h (x)$ as an exponential tilt for the carrier $h (x)$ :

Start with carrier $h (x)$ .
Multiply by $e^{η^{T} T (x)}$
Re-normalize by $e^{- A (η)}$ .

$T (X) = (T_{1} (X), \dots, T_{s} (X))$ can be viewed as giving linear space of directions in which we can tilt $h (x)$ . $Ξ_{1}$ is all tilts after which normalization is possible (not going to infinity).

5 Repeated Sampling from Exponential Families

One of the most important properties of exponential families is that a large sample can be summarized by a low-dimensional statistic.
Suppose $X = (X_{1}, \dots, X_{n})$ represents iid sample from an exponential family $X_{1}, \dots, X_{n} \overset{i . i . d}{\sim} p_{η}^{(1)} (x) = e^{η^{T} T (x) - A (η)} h (x),$ then $p_{η} (x) = \prod_{i = 1}^{n} e^{η^{T} T (x_{i}) - A (η)} h (x_{i}) = \exp {η^{T} \sum_{i = 1}^{n} T (x_{i}) - n A (η)} \prod_{i = 1}^{n} h (x_{i}) .$ This is an exponential family with sufficient statistic $\sum_{i = 1}^{n} T (X_{i})$ , base density $\prod_{i = 1}^{n} h (x_{i})$ and log-partition function $n A (η)$ .

1 Definition

1.1 Distribution of T(X)

1.2 Carnonical Form

2 Differential Identites

2.1 Mean of T(X)

2.2 Variance of T(X)

2.3 MGF of T(X)

2.4 CGF

3 Other Parameterizations

4 Interpretation: Exponential Tilting

5 Repeated Sampling from Exponential Families

1.1 Distribution of $T (X)$

2.1 Mean of $T (X)$

2.2 Variance of $T (X)$

2.3 MGF of $T (X)$