Multinomial Logistic Regression: Convexity and Smoothness

Published:


1. Softargmax function

The softargmax (a.k.a. softmax, normalized exponential) function $\DeclareMathOperator*{\argmax}{argmax} \newcommand{\bm}[1]{\boldsymbol#1} \newcommand{\abs}[1]{\left\lvert#1\right\rvert} \bm \sigma: \mathbb{R}^C \rightarrow \mathbb{R}^C$ is defined as

\[\bigl( \bm \sigma(\bm z) \bigr)_i = \frac{e^{z_i}}{\sum_{j=1}^C e^{z_j}} \text{ for } i=1,\ldots,C \text{ and } \bm z \in \mathbb{R}^C .\]

Notice that $\bm \sigma = \bm \sigma(\bm z)$ is a stochastic vector since the sum of its elements is $1$. For simplicity, we obmit the argument $\bm z$ in $\bm \sigma$ when there is no ambiguity.

Lemma 1.
The gradient of $\bm \sigma$ is given by $\nabla \bm \sigma = \bm D \bigl(\bm \sigma \bigr) - \bm \sigma \bm \sigma^T$, where $\bm D(\bm \sigma) = \text{diag}(\bm \sigma)$. Furthermore, it holds that
\[0 \preceq \nabla \bm \sigma \preceq \frac{1}{2} \Bigl( \bm I_C - \frac{1}{C} \bm 1_C \bm 1_C^T \Bigr) . \tag{1} \label{bound_sigma}\]

Proof.

For any $i,j = 1,\ldots,C$, we have

\[\frac{\partial \bigl( \bm \sigma(\bm z) \bigr)_i}{\partial z_j} = \frac{\delta_{ij} e^{z_i} \sum_{k=1}^C e^{z_k} - e^{z_i} e^{z_j}}{\Bigl(\sum_{k=1}^C e^{z_k}\Bigr)^2} = \delta_{ij} \bigl( \bm \sigma(\bm z) \bigr)_i - \bigl( \bm \sigma(\bm z) \bigr)_i \bigl( \bm \sigma(\bm z) \bigr)_j .\]

In matrix form, we can represent the gradient as

\[\nabla \bm \sigma = \Bigl[ \frac{\partial \sigma_i}{\partial z_j} \Bigr]_{ij} = \bm D \bigl(\bm \sigma \bigr) - \bm \sigma \bm \sigma^T .\]

The lower bound can be shown by the observation that $\bm M = \nabla \bm \sigma$ is diagonally dominant, i.e., $M_{ii} \geq \sum_{j \neq i} \abs{M_{ij}}$ for any row $i$. Let $\lambda$ be any eigenvalue of $\bm M$ and $\bm x$ be its corresponding eigenvector. Then $\bm M \bm x = \lambda \bm x$. Suppose $x_i = \argmax_{1\leq j \leq C} \abs{x_j}$, we have $M_{ii}x_i + \sum_{j \neq i} M_{ij} x_j = \lambda x_i$. This implies

\[\abs{M_{ii}-\lambda} \abs{x_i} = \abs{\sum_{j \neq i} M_{ij} x_j} \leq \Bigl(\sum_{j \neq i} \abs{M_{ij}}\Bigr) \abs{x_i} \leq M_{ii} .\]

The last inequality shows that $\lambda \geq 0$, or $\bm M$ is positive semidefinite (PSD). The upperbound in (\ref{bound_sigma}) can be found in [1]. $\qquad \blacksquare$

2. Multinomial logistic regression (MLR)

To begin with, let us consider the problem with just one observation including the input $\bm x \in \mathbb{R}^d$ and the one-hot output vector $\bm y \in \{ 0,1 \}^C$. The MLR model with parameter $\bm w = [\bm w_1^T, \ldots, \bm w_C^T]^T \in \mathbb{R}^{Cd}$ is given by

\[p_i = Pr(y_i=1 \mid \bm x ; \bm w) = \frac{e^{\bm w_i^T \bm x}}{\sum_{j=1}^C e^{\bm w_j^T \bm x}} \quad \text{ for } i=1,\ldots,C .\]

Here again for the reason of simplicity, we ignore the bias term. Notice that the stochastic vector $\bm p = [p_1,\ldots,p_C]$ is indeed a function of $\bm w$ and $\bm p(\bm w) = \bm \sigma(\bm W \bm x)$, where $\bm W = [\bm w_1,\ldots,\bm w_C]^T .$

The maximum likelihood estimation aims to optimize the following objective function

\[l(\bm w) = \log \prod_{i=1}^C p_i^{y_i} = \sum_{i=1}^C y_i \bm w_i^T \bm x - \log \Bigl( \sum_{i=1}^C e^{\bm w_i^T \bm x} \Bigr) .\]

For convenience, let us consider the minimization of the negative log likelihood $L(\bm w) = -l(\bm w)$. The derivation of the the gradient and the Hessian of $L(\bm w)$ involves some simple but interesting algebra. First, we begin with the derivative:

\[\frac{\partial L(\bm w)}{\partial \bm w_i} = -y_i \bm x_i + \frac{e^{\bm w_i^T \bm x}}{\sum_{j=1}^C e^{\bm w_j^T \bm x}} \bm x_i = (p_i - y_i) \bm x \quad \text{ for } i=1,\ldots,C .\]

In vector form, the gradient can be represented thanks to the notion of Kronecker product:

\[\nabla L(\bm w) = (\bm p - \bm y) \otimes \bm x . \tag{2} \label{gradient}\]

Now we compute the Hessian following a similar calculation of the gradient of the softargmax function in Section 1:

\[\frac{\partial^2 L(\bm w)}{\bm w_j \bm w_i^T} = \frac{\partial}{\partial \bm w_j} \big( (p_i - y_i) \bm x \bigr)^T = \frac{\partial p_i(\bm w)}{\partial \bm w_j} \bm x^T = \frac{\partial \bigl( \bm \sigma(\bm z) \bigr)_i}{\partial z_j} \frac{\bm w_j^T \bm x}{\bm w_j} \bm x^T = (I_{i=j} p_i - p_i p_j) \bm x \bm x^T .\]

From here, we come up with a lesser-known form of the Hessian

\[\nabla^2 L(\bm w) = \bigl(\bm D(\bm p) - \bm p \bm p^T \bigr) \otimes \bm x \bm x^T . \tag{3} \label{hessian}\]

Finally, when there are multiple observations the fomulas (\ref{gradient}) and (\ref{hessian}) can be extended to the average sum over all observations:

\begin{align} \nabla L(\bm w) &= \frac{1}{N} \sum_{n=1}^N (\bm y_n - \bm p_n) \otimes \bm x_n , \\
\nabla^2 L(\bm w) &= \frac{1}{N} \sum_{n=1}^N \bigl(\bm D(\bm p_n) - \bm p_n \bm p_n^T \bigr) \otimes \bm x_n \bm x_n^T . \end{align}

The following theorem asserts that $L(\bm w)$ is a smooth convex function and hence, explains why MLR enjoys a nice global geometry on appropriately normalized data.

Theorem 1.
Let $\bm X = [\bm x_1,\ldots,\bm x_N]^T \in \mathbb{R}^{N \times d}$. Then \begin{align} 0 \preceq \nabla^2 L(\bm w) \preceq \frac{1}{2} \Bigl( \bm I_C - \frac{1}{C} \bm 1_C \bm 1_C^T \Bigr) \otimes \frac{1}{N} \bm X^T \bm X . \tag{4} \label{bound_hessian} \end{align}

Proof.

We use the following property of Kronecker product: if $\bm A$ and $\bm B$ are square matrices of size $n \times n$ and $m \times m$, respectively, then the eigenvalues of $\bm A \otimes \bm B$ are of the form $\alpha_i \beta_j$ for $i=1,\ldots,n$ and $j=1,\ldots,m$, where $\alpha_i$ and $\beta_j$ are eigenvalues of $\bm A$ and $\bm B$. Thus by Lemma 1,

\[\lambda_{\min} \bigl(\nabla^2 L(\bm w)\bigr) = \frac{1}{N} \sum_{n=1}^N \min \lambda \bigl(\bm D(\bm p_n) - \bm p_n \bm p_n^T \bigr) \lambda (\bm x_n \bm x_n^T) \geq 0 ,\]

and

\[\lambda_{\max} \bigl(\nabla^2 L(\bm w)\bigr) = \frac{1}{N} \sum_{n=1}^N \max \lambda \bigl(\bm D(\bm p_n) - \bm p_n \bm p_n^T \bigr) \lambda (\bm x_n \bm x_n^T)\] \[\leq \max \lambda \biggl(\frac{1}{2} \Bigl( \bm I_C - \frac{1}{C} \bm 1_C \bm 1_C^T \Bigr)\biggl) \lambda \biggl(\frac{1}{N} \sum_{n=1}^N \bm x_n \bm x_n^T \biggr)\] \[= \lambda_{\max} \biggl(\frac{1}{2} \Bigl( \bm I_C - \frac{1}{C} \bm 1_C \bm 1_C^T \Bigr) \otimes \frac{1}{N} \sum_{n=1}^N \bm x_n \bm x_n^T \biggr) .\]

The lower and upper bounds on the eigenvalues of $\nabla^2 L(\bm w)$ imply (\ref{bound_hessian}). $\qquad \blacksquare$

The matrix $\bm R = \frac{1}{N} \bm X^T \bm X$ reminds us of the autocorrelation matrix. Indeed, if the data is standardized to zero mean and unit variance, $\bm R$ gives an estimate on the matrix of Pearson product-moment correlation coefficients.

Acknowledgement

I would like to thank Magnus Wiese and Neelesh Verma for their feedback on this note.

References

  1. D. Böhning, “Multinomial logistic regression algorithm,” Annals of the Institute of Statistical Mathematics, vol. 44, no. 1, pp. 197–200, 1992.