1 矩阵求导的本质结构
对于向量和标量的函数/自变量, 我们常常对他们的导数的定义感到困惑. 这里简单看几个例子回顾一下定义.
输出\输入
标量
向量
矩阵
标量
f ( x )
f ( x → )
f ( X )
向量
f → ( x )
f → ( x → )
f → ( X )
矩阵
F ( x )
F ( x → )
F ( X )
f : R 3 × 1 → R , f ( x → ) = x 1 2 + x 1 x 2 + x 2 x 3 . 则 ∂ f ( x → ) ∂ x → 3 × 1 = [ ∂ f ∂ x 1 ∂ f ∂ x 2 ∂ f ∂ x 3 ] T = [ 2 x 1 + x 2 x 1 + x 3 x 2 ] T .
f → : R 3 × 1 → R 2 × 1 , f → ( x ) = ( x 1 + x 2 , x 1 + x 3 ) T , 则 ∂ f → ∂ x → T = ( ∂ f 1 ∂ x 1 ⋯ ∂ f 1 ∂ x 3 ∂ f 2 ∂ x 1 ⋯ ∂ f 2 ∂ x 3 ) 2 × 3 .
一般地,
f : R n × 1 → R , 则 D x f ( x → ) = ∂ f ( x ) ∂ x T = ( ∂ f ∂ x 1 , ⋯ , ∂ f ∂ x n ) , ∇ x f ( x ) = ∂ f ( x ) ∂ x .
f : R m × n → R , 定义 vec ( X ) = ( x 11 , ⋯ , x n 1 , x 12 , ⋯ , x m 2 , ⋯ , x 1 n , ⋯ , x m n ) T , 定义 D vec ( X ) f ( X ) = ∂ f ( X ) ∂ vec T ( X ) = ( ∂ f ∂ x 11 , ⋯ , ∂ f ∂ x m 1 , ⋯ , ∂ f ∂ x 1 n , ⋯ , ∂ f ∂ x m n ) , D X f ( X ) = ∂ f ( X ) ∂ X m × n T = ( ∂ f ∂ X 11 ⋯ ∂ f ∂ X m 1 ⋮ ⋱ ⋮ ∂ f ∂ X 1 n ⋯ ∂ f ∂ X m n ) n × m , 以及类似的 ∇ vec ( X ) f ( X ) , ∇ X f ( X ) .
F : R m × n → R p × q , 则类似地可定义 vec ( F ( X ) ) , 则 D X F ( X ) = ( ∂ vec p q × 1 ( F ( X ) ) ∂ vec m n × 1 T ( X ) ) p q × m n , ∇ X F ( X ) = ( ∂ vec p q × 1 T ( F ( X ) ) ∂ vec m n × 1 ( X ) ) m n × p q .
2 基于本质结构的数学推导
也即, 直接采用定义计算.
x → = ( x 1 , ⋯ , x n ) T , 用上述定义计算得:
设 c ∈ R . ∂ c ∂ x = 0 n × 1 , ∂ [ c 1 f ( x ) + c 2 g ( x ) ] ∂ x = c 1 ∂ f ∂ x + c 2 ∂ g ∂ x , ∂ ( f ⋅ g ) ∂ x = ∂ f ∂ x g ( x ) + ∂ g ∂ x f ( x ) , ∂ ( f / g ) ∂ x = 1 g 2 ( x ) ( ∂ f ∂ x g ( x ) − f ∂ g ∂ x ) .
设 a = ( a 1 , ⋯ , a n ) T , A = ( a i j ) n × n . ∂ x T a ∂ x = ∂ ( a T x ) ∂ x = a , ∂ ( x T x ) ∂ x = 2 x , ∂ ( x T A x ) ∂ x = ( A + A T ) x , ∂ ( a T x x T b ) ∂ x = ( a b T + b a T ) x .
注意到 ∂ ( x T x ) ∂ x = ∂ ( ∑ i = 1 n x i 2 ) ∂ x = 2 ( x 1 , ⋯ , x n ) T = 2 x .
同理 ∂ ( x T A x ) ∂ x = ∂ ( ∑ i , j a i j x i x j ) ∂ x = ( ∑ j a 1 j x j + ∑ i a i 1 x i , ⋯ , ∑ j a n j x j + ∑ i a i n x i ) = ( A + A T ) x .
最后因为 a T x , x T b ∈ R , 结合刚才的结果 ∂ ( a T x x T b ) ∂ x = ∂ ( x T a b T x ) ∂ x = ( a b T + b a T ) x .
X = ( x i j ) m × n ,
与 1 (1) 完全相同
Missing \end{align*} \begin{align*} \begin{align*}
\frac{\partial (a ^{\mathrm{T}}X ^{\mathrm{T}}b)}{\partial X}&= ba ^{\mathrm{T}},\
\frac{\partial (a ^{\mathrm{T}}XX ^{\mathrm{T}}b)}{\partial X}&= (ab ^{\mathrm{T}}+ba ^{\mathrm{T}})X,\
\frac{\partial (a ^{\mathrm{T}}X ^{\mathrm{T}}Xb)}{\partial X}&= X(ba ^{\mathrm{T}}+ab ^{\mathrm{T}}).
\end{align*}$$
3 基于迹的快速求导法
首先回顾迹 的性质: tr ( A T ) = tr ( A ) , tr ( A B ) = tr ( B A ) .
此外, 定义全微分 . 若 f : R m × n → R , ∂ f ∂ X ∈ R m × n , 则 d f = ∑ i = 1 m ∑ j = 1 n ∂ f ∂ X i j d X i j . 又 tr ( A T B ) = ∑ i = 1 m ∑ j = 1 n A i j B i j , 则联立两式得 (3.1) d f = tr ( ( ∂ f ∂ X ) T ⋅ d X ) .
0. 常数、线性和、乘除同前; d F p × q T ( x ) = ( d F p × q ( x ) ) T .
d ( A F ( X ) B ) = A d ( F ( X ) ) B .
d | X | = | X | tr ( X − 1 d X ) .
d ( X − 1 ) = \tr | F ( X ) | F ( X ) − 1 d F ( X ) .
2. 根据行列式的性质: | X | = x i 1 A i 1 + ⋯ + x i n A i n , ∀ i , 从而 ∂ | X | ∂ x i j = A i j , 从而 ∂ | X | ∂ X T = ( A i j ) n × n = X ∗ = X − 1 | X | . 从而根据 (3.1) d | X | = \tr ( ∂ | X | ∂ X T d X ) = \tr ( X − 1 | X | d X ) = | X | \tr ( X − 1 d X ) .
3. 由于 I = X X − 1 , 故 d X ⋅ X − 1 + X d ( X − 1 ) = 0 .
接下来我们来看一些具体的例子. 首先注意, 对 f : R m × n → R , (3.2) \tr ( f ( X ) ) = f ( X ) ⇒ d tr f ( X ) = d f ( X ) = tr d f ( X ) .
∂ a T X X T b ∂ X = ( a b T + b a T ) X . 这是因为首先根据 (3.2) 和迹的基本性质 : d ( a T X X T b ) = \tr ( d ( a T X X T b ) ) = \tr ( a T ( d ( X ) X T + X d ( X T ) ) b ) = \tr ( a T d X X T b ) + \tr ( a T X d X T b ) = \tr ( X T b a T d X ) + \tr ( b T d X X T a ) = \tr ( X T b a T d X ) + \tr ( X T a b T d X ) = \tr ( X T ( b a T + a b T ) d X ) , 从而结果为 ( X T ( b a T + a b T ) ) T = ( a b T + b a T ) X .
∂ \tr ( X T X ) ∂ X = 2 X . 因为 d ( \tr ( X T X ) ) = \tr ( d X T X + d X X T ) = \tr ( 2 X T d X ) .
∂ log | X | ∂ X = ( X − 1 ) T . 利用性质第二条 : d log | X | = \tr ( d log | X | ) = \tr ( 1 | X | d | X | ) = \tr ( \tr ( X − 1 d X ) ) = \tr ( X − 1 d X ) .
∂ | X − 1 | ∂ X = − | X − 1 | ( X − 1 ) T . 同上, 利用性质第二、三条 : d | X − 1 | = | X − 1 | \tr ( ( X − 1 ) − 1 d ( X − 1 ) ) = | X − 1 | \tr ( X d X − 1 ) = | X − 1 | \tr ( − X X − 1 d X X − 1 ) = | X − 1 | \tr ( − X − 1 d X ) .
∂ \tr ( X + A ) − 1 ∂ X = − ( ( X + A ) − 2 ) T . 因为\tr ( d ( X + A ) − 1 ) = \tr ( − ( X + A ) − 1 ( d ( X + A ) ) ( X + A ) − 1 ) = \tr ( − ( X + A ) − 2 d X ) .
∂ | X 3 | ∂ X = 3 | X | 3 ( X − 1 ) T . 因为 d | X 3 | = \tr ( d | X | 3 ) = \tr ( 3 | X | 2 d | X | ) = \tr ( 3 | X | 3 \tr ( X − 1 d X ) ) = \tr ( 3 | X | 3 X − 1 d X ) .