Multinomial Logistic Regression: Convexity and Smoothness
Published:
1. Softargmax function
The softargmax (a.k.a. softmax, normalized exponential) function σ:RC→RC is defined as
(σ(z))i=ezi∑Cj=1ezj for i=1,…,C and z∈RC.Notice that σ=σ(z) is a stochastic vector since the sum of its elements is 1. For simplicity, we obmit the argument z in σ when there is no ambiguity.
- Lemma 1.
- The gradient of σ is given by ∇σ=D(σ)−σσT, where D(σ)=diag(σ). Furthermore, it holds that
Proof.
For any i,j=1,…,C, we have
∂(σ(z))i∂zj=δijezi∑Ck=1ezk−eziezj(∑Ck=1ezk)2=δij(σ(z))i−(σ(z))i(σ(z))j.In matrix form, we can represent the gradient as
∇σ=[∂σi∂zj]ij=D(σ)−σσT.The lower bound can be shown by the observation that M=∇σ is diagonally dominant, i.e., Mii≥∑j≠i|Mij| for any row i. Let λ be any eigenvalue of M and x be its corresponding eigenvector. Then Mx=λx. Suppose xi=argmax1≤j≤C|xj|, we have Miixi+∑j≠iMijxj=λxi. This implies
|Mii−λ||xi|=|∑j≠iMijxj|≤(∑j≠i|Mij|)|xi|≤Mii.The last inequality shows that λ≥0, or M is positive semidefinite (PSD). The upperbound in (1) can be found in [1]. ◼
2. Multinomial logistic regression (MLR)
To begin with, let us consider the problem with just one observation including the input x∈Rd and the one-hot output vector y∈{0,1}C. The MLR model with parameter w=[wT1,…,wTC]T∈RCd is given by
pi=Pr(yi=1∣x;w)=ewTix∑Cj=1ewTjx for i=1,…,C.Here again for the reason of simplicity, we ignore the bias term. Notice that the stochastic vector p=[p1,…,pC] is indeed a function of w and p(w)=σ(Wx), where W=[w1,…,wC]T.
The maximum likelihood estimation aims to optimize the following objective function
l(w)=logC∏i=1pyii=C∑i=1yiwTix−log(C∑i=1ewTix).For convenience, let us consider the minimization of the negative log likelihood L(w)=−l(w). The derivation of the the gradient and the Hessian of L(w) involves some simple but interesting algebra. First, we begin with the derivative:
∂L(w)∂wi=−yixi+ewTix∑Cj=1ewTjxxi=(pi−yi)x for i=1,…,C.In vector form, the gradient can be represented thanks to the notion of Kronecker product:
∇L(w)=(p−y)⊗x.Now we compute the Hessian following a similar calculation of the gradient of the softargmax function in Section 1:
∂2L(w)wjwTi=∂∂wj((pi−yi)x)T=∂pi(w)∂wjxT=∂(σ(z))i∂zjwTjxwjxT=(Ii=jpi−pipj)xxT.From here, we come up with a lesser-known form of the Hessian
∇2L(w)=(D(p)−ppT)⊗xxT.Finally, when there are multiple observations the fomulas (2) and (3) can be extended to the average sum over all observations:
∇L(w)=1NN∑n=1(yn−pn)⊗xn,∇2L(w)=1NN∑n=1(D(pn)−pnpTn)⊗xnxTn.
The following theorem asserts that L(w) is a smooth convex function and hence, explains why MLR enjoys a nice global geometry on appropriately normalized data.
- Theorem 1.
- Let X=[x1,…,xN]T∈RN×d. Then 0⪯∇2L(w)⪯12(IC−1C1C1TC)⊗1NXTX.
Proof.
We use the following property of Kronecker product: if A and B are square matrices of size n×n and m×m, respectively, then the eigenvalues of A⊗B are of the form αiβj for i=1,…,n and j=1,…,m, where αi and βj are eigenvalues of A and B. Thus by Lemma 1,
λmin(∇2L(w))=1NN∑n=1minλ(D(pn)−pnpTn)λ(xnxTn)≥0,and
λmax(∇2L(w))=1NN∑n=1maxλ(D(pn)−pnpTn)λ(xnxTn) ≤maxλ(12(IC−1C1C1TC))λ(1NN∑n=1xnxTn) =λmax(12(IC−1C1C1TC)⊗1NN∑n=1xnxTn).The lower and upper bounds on the eigenvalues of ∇2L(w) imply (4). ◼
The matrix R=1NXTX reminds us of the autocorrelation matrix. Indeed, if the data is standardized to zero mean and unit variance, R gives an estimate on the matrix of Pearson product-moment correlation coefficients.
Acknowledgement
I would like to thank Magnus Wiese and Neelesh Verma for their feedback on this note.
References
- D. Böhning, “Multinomial logistic regression algorithm,” Annals of the Institute of Statistical Mathematics, vol. 44, no. 1, pp. 197–200, 1992.