Multinomial Logistic Regression: Convexity and Smoothness

Published:


1. Softargmax function

The softargmax (a.k.a. softmax, normalized exponential) function σ:RCRC is defined as

(σ(z))i=eziCj=1ezj for i=1,,C and zRC.

Notice that σ=σ(z) is a stochastic vector since the sum of its elements is 1. For simplicity, we obmit the argument z in σ when there is no ambiguity.

Lemma 1.
The gradient of σ is given by σ=D(σ)σσT, where D(σ)=diag(σ). Furthermore, it holds that
0σ12(IC1C1C1TC).

Proof.

For any i,j=1,,C, we have

(σ(z))izj=δijeziCk=1ezkeziezj(Ck=1ezk)2=δij(σ(z))i(σ(z))i(σ(z))j.

In matrix form, we can represent the gradient as

σ=[σizj]ij=D(σ)σσT.

The lower bound can be shown by the observation that M=σ is diagonally dominant, i.e., Miiji|Mij| for any row i. Let λ be any eigenvalue of M and x be its corresponding eigenvector. Then Mx=λx. Suppose xi=argmax1jC|xj|, we have Miixi+jiMijxj=λxi. This implies

|Miiλ||xi|=|jiMijxj|(ji|Mij|)|xi|Mii.

The last inequality shows that λ0, or M is positive semidefinite (PSD). The upperbound in (1) can be found in [1].

2. Multinomial logistic regression (MLR)

To begin with, let us consider the problem with just one observation including the input xRd and the one-hot output vector y{0,1}C. The MLR model with parameter w=[wT1,,wTC]TRCd is given by

pi=Pr(yi=1x;w)=ewTixCj=1ewTjx for i=1,,C.

Here again for the reason of simplicity, we ignore the bias term. Notice that the stochastic vector p=[p1,,pC] is indeed a function of w and p(w)=σ(Wx), where W=[w1,,wC]T.

The maximum likelihood estimation aims to optimize the following objective function

l(w)=logCi=1pyii=Ci=1yiwTixlog(Ci=1ewTix).

For convenience, let us consider the minimization of the negative log likelihood L(w)=l(w). The derivation of the the gradient and the Hessian of L(w) involves some simple but interesting algebra. First, we begin with the derivative:

L(w)wi=yixi+ewTixCj=1ewTjxxi=(piyi)x for i=1,,C.

In vector form, the gradient can be represented thanks to the notion of Kronecker product:

L(w)=(py)x.

Now we compute the Hessian following a similar calculation of the gradient of the softargmax function in Section 1:

2L(w)wjwTi=wj((piyi)x)T=pi(w)wjxT=(σ(z))izjwTjxwjxT=(Ii=jpipipj)xxT.

From here, we come up with a lesser-known form of the Hessian

2L(w)=(D(p)ppT)xxT.

Finally, when there are multiple observations the fomulas (2) and (3) can be extended to the average sum over all observations:

L(w)=1NNn=1(ynpn)xn,2L(w)=1NNn=1(D(pn)pnpTn)xnxTn.

The following theorem asserts that L(w) is a smooth convex function and hence, explains why MLR enjoys a nice global geometry on appropriately normalized data.

Theorem 1.
Let X=[x1,,xN]TRN×d. Then 02L(w)12(IC1C1C1TC)1NXTX.

Proof.

We use the following property of Kronecker product: if A and B are square matrices of size n×n and m×m, respectively, then the eigenvalues of AB are of the form αiβj for i=1,,n and j=1,,m, where αi and βj are eigenvalues of A and B. Thus by Lemma 1,

λmin(2L(w))=1NNn=1minλ(D(pn)pnpTn)λ(xnxTn)0,

and

λmax(2L(w))=1NNn=1maxλ(D(pn)pnpTn)λ(xnxTn) maxλ(12(IC1C1C1TC))λ(1NNn=1xnxTn) =λmax(12(IC1C1C1TC)1NNn=1xnxTn).

The lower and upper bounds on the eigenvalues of 2L(w) imply (4).

The matrix R=1NXTX reminds us of the autocorrelation matrix. Indeed, if the data is standardized to zero mean and unit variance, R gives an estimate on the matrix of Pearson product-moment correlation coefficients.

Acknowledgement

I would like to thank Magnus Wiese and Neelesh Verma for their feedback on this note.

References

  1. D. Böhning, “Multinomial logistic regression algorithm,” Annals of the Institute of Statistical Mathematics, vol. 44, no. 1, pp. 197–200, 1992.