Convergence of Heavy-Ball Method and Nesterov’s Accelerated Gradient on Quadratic Optimization
Published:
We consider the following quadratic optimization problem:
\[\DeclareMathOperator*{\argmin}{argmin} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\abs}[1]{\left\lvert#1\right\rvert} \newcommand{\bm}[1]{\boldsymbol#1} \min_{\bm x \in \mathbb{R}^n} f(\bm x) = \frac{1}{2} \bm x^T \bm A \bm x - \bm b^T \bm x + c , \tag{1} \label{equ:Abc}\]where $\bm A \in \mathbb{R}^{n \times n}$ is a positive semi-definite matrix with eigenvalues $L = \lambda_1 \geq \ldots \geq \lambda_n = \mu > 0$. The quadratic model is a simple but powerful enough in studying local convergence of optimization algorithms. As you can see in the sequel, this nice model allows us to extract the closed-form expression of the convergence rate through basic elementary algebra. Yet before moving to the algorithms, let us make a further simplification of (\ref{equ:Abc}) by a change of variables: $\tilde{\bm x} = \bm U^T (\bm x - \bm x^\star)$, where $\bm x^\star$ is the unique minimum of the problem and $\bm U$ is the orthogonal basis in the eigenvalue decomposition of $\bm A$, i.e., $\bm A = \bm U \text{diag}(\lambda_1,\ldots,\lambda_n) \bm U^T$. Thus we have an alternative minimization
\[\min_{\bm x \in \mathbb{R}^n} f(\bm x) = \frac{1}{2} \bm x^T \bm \Lambda \bm x \tag{2} \label{equ:standard} ,\]with $\bm \Lambda = \text{diag}(\lambda_1,\ldots,\lambda_n)$. The gradient and Hessian of the objective function are given by
\[\nabla f(\bm x) = \bm \Lambda \bm x , \qquad \nabla^2 f(\bm x) = \bm \Lambda .\]Furthermore, we denote $\kappa=\frac{L}{\mu} \geq 1$ the condition number of the objective function. Intuitively, the higher $\kappa$ is, the harder it is to optimize $f$. When $\kappa$ approaches $\infty$, problem (\ref{equ:standard}) is said to be ill-conditioned. Our goal in this note is to explain Proposition 1 in [3], given by the following table:
Method | Parameter choice | Convergence rate | Iteration complexity |
---|---|---|---|
Gradient descent (GD) | $\alpha=\frac{2}{L+\mu}$ | $\rho=\frac{\kappa-1}{\kappa+1}$ | $O(\kappa \log \frac{1}{\epsilon})$ |
Heavy-Ball (HB) | $\alpha=\biggl(\frac{2}{\sqrt{L}+\sqrt{\mu}}\biggr)^2, \beta=\biggl( \frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1} \biggr)^2$ | $\rho=\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}$ | $O(\sqrt{\kappa} \log \frac{1}{\epsilon})$ |
Nesterov's Accelerated Gradient (NAG) | $\alpha=\frac{4}{3L+\mu}, \beta=\frac{\sqrt{3\kappa+1}-2}{\sqrt{3\kappa+1}+2}$ | $\rho=1-\frac{2}{\sqrt{3\kappa+1}}$ | $O(\sqrt{\kappa} \log \frac{1}{\epsilon})$ |
Note that here we restrict our attention to the case with constant step sizes. Without further due let us get to the algorithmic part.
1. Gradient Descent
To warm-up, we consider the update of gradient descent:
\[\bm x^{(k+1)} = \bm x^{(k)} - \alpha \nabla f(\bm x^{(k)}) \qquad \text{ for some constant } \alpha>0 .\]Substituting the gradient of $f$ and taking the Euclidean norm, we have
\begin{align} \norm{\bm x^{(k+1)}} &= \norm{\bm x^{(k)} - \alpha \bm \Lambda \bm x^{(k)}} = \norm{(\bm I_n - \alpha \bm \Lambda) \bm x^{(k)}} \\
&\leq \norm{\bm I_n - \alpha \bm \Lambda}_2 \norm{\bm x^{(k)}} = \max_j \abs{1-\alpha \lambda_j} \norm{\bm x^{(k)}} \\
&= \max \Bigl\{ \abs{1-\alpha \mu}, \abs{1-\alpha L} \Bigr\} \norm{\bm x^{(k)}} . \end{align}
where the last equality stems from fact that $\abs{1-\alpha \lambda}$ is a continuous and convex function of $\lambda$. Now let $\rho_\alpha = \max \Bigl\{ \abs{1-\alpha \mu}, \abs{1-\alpha L} \Bigr\}$, we rewrite the recursion as
\[\norm{\bm x^{(k)}} \leq \rho_\alpha^k \norm{\bm x^{(0)}} . \tag{3} \label{equ:GD}\]The inequality (\ref{equ:GD}) shows that GD converges linearly (or geometrically) to the optimal solution $\bm x^\star = \bm 0$ at rate $\rho_\alpha$. In order to converge, we need $\rho_\alpha < 1$. This is equivalent to having \begin{align} 0 < \alpha < \frac{2}{L} . \end{align} The optimal step size is chosen by solving \begin{align} \alpha_{opt} = \argmin_{0 < \alpha < \frac{2}{L}} \max \Bigl\{ \abs{1-\alpha \mu}, \abs{1-\alpha L} \Bigr\} . \end{align} Although it can be observed that the solution happens at which the two quantities are equal, solving this minimization is non-trivial. First, it is noticeable that \begin{align} \abs{1-\alpha \mu} < \abs{1-\alpha L} \Leftrightarrow \alpha > \frac{2}{L+\mu} . \end{align} Thus, we consider two cases:
If $0 < \alpha \leq \frac{2}{L+\mu}$, then \begin{align} \max \Bigl\{ \abs{1-\alpha \mu}, \abs{1-\alpha L} \Bigr\} = 1-\alpha \mu \geq 1-\frac{2\mu}{L+\mu} = \frac{L-\mu}{L+\mu} . \end{align} The optimal step size in this case is $\frac{2}{L+\mu}$. Compared with the popular but suboptimal choice of step size we see in the literature $\alpha = \frac{1}{L}$, this step size is twice as large.
Otherwise, if $\frac{2}{L+\mu} < \alpha < \frac{2}{L}$, then \begin{align} \max \Bigl\{ \abs{1-\alpha \mu}, \abs{1-\alpha L} \Bigr\} = \alpha L - 1 > \frac{2L}{L+\mu} - 1 = \frac{L-\mu}{L+\mu} . \end{align}
In summary, we have $\alpha_{opt}=\frac{2}{L+\mu}$ and $\rho_{opt}=\frac{\kappa-1}{\kappa+1}$. At this optimal step size, the number of iterations to reach (relative) $\epsilon$-accuracy, i.e., $\frac{\norm{\bm x^{(k)}}}{\norm{\bm x^{(0)}}} < \epsilon \ll 1$, satisfies \begin{align} \rho_{opt}^k < \epsilon \Rightarrow k > \frac{\log 1/\epsilon}{\log \bigl(1 - \frac{2}{\kappa+1} \bigr)} \approx \frac{\kappa+1}{2} \log \frac{1}{\epsilon} . \end{align} In other words, the iteration complexity of GD with optimal step size is $O(\kappa \log \frac{1}{\epsilon})$.
2. Heavy-Ball Method
The Heavy-Ball method adds a second term from the previous iterate to the update equation of gradient descent:
\[\bm x^{(k+1)} = \bm x^{(k)} - \alpha \nabla f(\bm x^{(k)}) + \beta (\bm x^{(k)} - \bm x^{(k-1)}) \qquad \text{ for some constants } \alpha>0, \beta\geq 0 .\]On a standard quadratic, the update can be further simplified as
\[\bm x^{(k+1)} = \bm x^{(k)} - \alpha \bm \Lambda \bm x^{(k)} + \beta (\bm x^{(k)} - \bm x^{(k-1)}) = \bigl( (1+\beta) \bm I_n - \alpha \bm \Lambda \bigr) \bm x^{(k)} - \beta \bm x^{(k-1)} .\]Stacking the current iterate and the previous one and taking the norm yield \begin{align} \begin{bmatrix} \bm x^{(k+1)} \\\ \bm x^{(k)} \end{bmatrix} = \begin{bmatrix} (1+\beta)\bm I_n - \alpha \bm \Lambda & -\beta \\\ \bm 0 & \bm I_n \end{bmatrix} \begin{bmatrix} \bm x^{(k)} \\\ \bm x^{(k-1)} \end{bmatrix} . \end{align} Denote $\bm y^{(k)} = \begin{bmatrix} \bm x^{(k+1)} \\\ \bm x^{(k)} \end{bmatrix}$ and $\bm T = \begin{bmatrix} (1+\beta)\bm I_n - \alpha \bm \Lambda & -\beta \\\ \bm 0 & \bm I_n \end{bmatrix}$, we derive the exponential decrease in the norm of $\bm y$:
\[\norm{\bm y^{(k)}} = \norm{\bm T \bm y^{(k-1)}} = \norm{\bm T^k \bm y^{(0)}} \leq \norm{\bm T^k}_2 \norm{\bm y^{(0)}} \leq \bigl(\rho(\bm T) + o(1) \bigr)^k \norm{\bm y^{(0)}} , \tag{4} \label{equ:HB}\]where $\rho(\bm T)$ is the spectral radius of $\bm T$ and the last inequality uses Gelfand’s formula. Thus, in order to determine the convergence rate, we will need to find the eigenvalues of $\bm T$ and determine their maximum absolute value. By carefully looking at its special structure, one can show that $\bm T$ is permutation-similar to a block diagonal matrix: \begin{align} \bm T \sim \begin{bmatrix} \bm T_1 & \bm 0 & \ldots & \bm 0 \\\ \bm 0 & \bm T_2 & \ldots & \bm 0 \\\ \vdots & & \ldots & \vdots \\\ \bm 0 & \bm 0 & \ldots & \bm T_n \end{bmatrix} \quad \text{where} \quad \bm T_j = \begin{bmatrix} 1+\beta-\alpha \lambda_j & -\beta \\\ 0 & 1 \end{bmatrix} \text{ for } j=1,2,\ldots,n . \end{align} Hence, the eigenvalues of $\bm T$ are the union of the eigenvalues of $\bm T_j$. For each $j$, the eigenvalues of $\bm T_j$ are the two roots of the equation
\[r^2 - (1+\beta - \alpha \lambda_j) r + \beta = 0 \quad \text{with} \quad \Delta_j = (1+\beta - \alpha \lambda_j)^2 - 4\beta .\]Therefore, we can bound the convergence rate by \begin{align} \rho_{\alpha,\beta} \approx \rho(\bm T) = \max_j r_j = \max {r_1, r_n}, \quad \text{where } r_j = \begin{cases} \frac{1}{2} \Bigl( \abs{1+\beta-\alpha \lambda_j} + \sqrt{\Delta_j} \Bigr) & \text{if } \Delta_j > 0, \\\ \sqrt{\beta} & \text{otherwise.} \end{cases} \end{align} Here we can think of $r$ as a continuous and quasiconvex function of $\lambda$. The choice of step sizes must satisfy $\rho_{\alpha,\beta} < 1$, or equivalently, \begin{align} \begin{cases} 0 \leq \beta < 1, \\\ 0 < \alpha < \frac{2(1+\beta)}{L} . \tag{5} \label{equ:HB_range} \end{cases} \end{align} The optimal step size is chosen by minimizing $\rho_{\alpha,\beta}$ over the aforementioned range. This optimization is more complicated than the one with GD. By observing that \begin{align} \Delta_j \leq 0 \Leftrightarrow \beta \geq \bigl(1-\sqrt{\alpha \lambda_j}\bigr)^2 \quad \text{and} \quad \abs{1-\sqrt{\alpha \mu}} < \abs{1-\sqrt{\alpha L}} \Leftrightarrow \alpha > \biggl( \frac{2}{\sqrt{L}+\sqrt{\mu}} \biggr)^2 , \end{align} we can break it into $4$ cases:
If $0 < \alpha \leq \Bigl( \frac{2}{\sqrt{L}+\sqrt{\mu}} \Bigr)^2$ and $\beta \geq (1-\sqrt{\alpha \mu})^2$, then \begin{align} \max \{ r_1, r_n \} = \sqrt{\beta} \geq 1-\sqrt{\alpha \mu} \geq 1 - \frac{2\sqrt{\mu}}{\sqrt{L}+\sqrt{\mu}} = \frac{\sqrt{L}-\sqrt{\mu}}{\sqrt{L}+\sqrt{\mu}} . \end{align} In this case, the equality holds at the boundary.
If $0 < \alpha \leq \Bigl( \frac{2}{\sqrt{L}+\sqrt{\mu}} \Bigr)^2$ and $\beta < (1-\sqrt{\alpha \mu})^2$, then \begin{align} \max \{ r_1, r_n \} \geq r_n(\alpha,\beta) = \frac{1}{2} \Bigl( 1+\beta-\alpha \mu + \sqrt{(1+\beta - \alpha \mu)^2 - 4\beta} \Bigr) . \end{align} It can be verified that $r_n(\alpha,\beta)$ is a decreasing function of $\beta$. Hence, \begin{align} \max \{ r_1, r_n \} > r_n\bigl(\alpha,(1-\sqrt{\alpha \mu})^2\bigr) = 1-\sqrt{\alpha \mu} \geq \frac{\sqrt{L}-\sqrt{\mu}}{\sqrt{L}+\sqrt{\mu}} . \end{align}
If $\alpha > \Bigl( \frac{2}{\sqrt{L}+\sqrt{\mu}} \Bigr)^2$ and $\beta \geq (\sqrt{\alpha L}-1)^2$, then \begin{align} \max \{ r_1, r_n \} = \sqrt{\beta} \geq \sqrt{\alpha L}-1 > \frac{2\sqrt{L}}{\sqrt{L}+\sqrt{\mu}}-1 = \frac{\sqrt{L}-\sqrt{\mu}}{\sqrt{L}+\sqrt{\mu}} . \end{align}
If $\alpha > \Bigl( \frac{2}{\sqrt{L}+\sqrt{\mu}} \Bigr)^2$ and $\beta < (\sqrt{\alpha L}-1)^2$, then \begin{align} \max \{ r_1, r_n \} \geq r_1(\alpha,\beta) = \frac{1}{2} \Bigl( -1-\beta+\alpha \mu + \sqrt{(1+\beta - \alpha \mu)^2 - 4\beta} \Bigr) . \end{align} Similarly, $r_1(\alpha,\beta)$ is a decreasing function of $\beta$ and \begin{align} \max \{ r_1, r_n \} > r_1\bigl(\alpha,(\sqrt{\alpha L}-1)^2\bigr) = \sqrt{\alpha L}-1 \geq \frac{\sqrt{L}-\sqrt{\mu}}{\sqrt{L}+\sqrt{\mu}} . \end{align}
In summary, we have
\[\alpha_{opt}=\biggl( \frac{2}{\sqrt{L}+\sqrt{\mu}} \biggr)^2, \quad \beta_{opt} = \biggl( \frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1} \biggr)^2 \quad \text{and} \quad \rho_{opt}=\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1} . \tag{6} \label{equ:HB_optimal}\]With this optimal choice, the number of iterations to reach (relative) $\epsilon$-accuracy satisfies \begin{align} \rho_{opt}^k < \epsilon \Rightarrow k > \frac{\log 1/\epsilon}{\log \bigl(1 - \frac{2}{\sqrt{\kappa}+1} \bigr)} \approx \frac{\sqrt{\kappa}+1}{2} \log \frac{1}{\epsilon} . \end{align} In other words, the iteration complexity of HB with optimal step size is $O(\sqrt{\kappa} \log \frac{1}{\epsilon})$.
3. Nesterov’s Accelerated Gradient
The HB update can be rewritten as
\[\begin{cases} \bm y^{(k+1)} = \bm x^{(k)} - \alpha \nabla f(\bm x^{(k)}) , \\ \bm x^{(k+1)} = \bm y^{(k+1)} + \beta (\bm x^{(k)} - \bm x^{(k-1)}) \end{cases} .\]While Polyak’s method evaluates the gradient before adding momentum, Nesterov suggested the other way around: evaluates the gradient after applying momentum
\[\begin{cases} \bm y^{(k+1)} = \bm x^{(k)} - \alpha \nabla f(\bm x^{(k)}) , \\ \bm x^{(k+1)} = \bm y^{(k+1)} + \beta (\bm y^{(k)} - \bm y^{(k-1)}) \end{cases} .\]This simple change produces a movement through the “lookahead” gradient that potentially corrects for future oscillations. The Nesterov’s accelerated gradient update can be represented in one line as
\[\bm x^{(k+1)} = \bm x^{(k)} + \beta (\bm x^{(k)} - \bm x^{(k-1)}) - \alpha \nabla f \bigl( \bm x^{(k)} + \beta (\bm x^{(k)} - \bm x^{(k-1)}) \bigr) .\]Substituting the gradient of $f$ in quadratic case yields
\begin{align} \bm x^{(k+1)} &= \bm x^{(k)} + \beta (\bm x^{(k)} - \bm x^{(k-1)}) - \alpha \bm \Lambda \bigl( \bm x^{(k)} + \beta (\bm x^{(k)} - \bm x^{(k-1)}) \bigr) \\
&= (1+\beta) (\bm I_n - \alpha \bm \Lambda) \bm x^{(k)} - \beta (\bm I_n - \alpha \bm \Lambda) \bm x^{(k-1)} . \end{align}
Now one could attempt to derive a similar expression to (\ref{equ:HB}), but there is a better way to tackle the problem that gets rid of the $o(1)$ term in the convergence rate. By breaking the recursion into each dimension, we obtain the corresponding second order linear difference equation
\[x_j^{(k+1)} - (1+\beta)(1-\alpha \lambda_j) x_j^{(k)} + \beta (1-\alpha \lambda_j) x_j^{(k-1)} = 0 .\]Solving the characteristic equation
\[r^2 - (1+\beta)(1-\alpha \lambda_j) r + \beta (1-\alpha \lambda_j) = 0 , \quad \text{with} \quad \Delta_j = (1+\beta)^2 (1-\alpha \lambda_j)^2 - 4 \beta (1-\alpha \lambda_j) ,\]yields three cases:
If $\Delta_j = 0$, then there are two repeated real roots $r_{12} = \frac{1}{2} (1+\beta)(1-\alpha \lambda_j)$ and \begin{align} x_j^{(k)} = (C_1 + C_2 k) r_{12}^k \quad \text{for some constants } C_1, C_2 \text{ depend on the initial point} . \end{align}
If $\Delta_j > 0$, then there are two distinct real roots $r_{12} = \frac{1}{2} \Bigl( (1+\beta)(1-\alpha \lambda_j) \pm \sqrt{\Delta_j} \Bigr)$ and \begin{align} x_j^{(k)} = C_1 r_{1}^k + C_2 r_{2}^k \quad \text{for some constants } C_1, C_2 \text{ depend on the initial point}. \end{align}
If $\Delta_j < 0$, then there are two conjugate complex roots satisfying $\abs{r_{12}} = \sqrt{\beta (1-\alpha \lambda_j)}$ and \begin{align} x_j^{(k)} = \Bigl( \sqrt{\beta (1-\alpha \lambda_j)} \Bigr)^k \bigl( C_1 \cos(\theta k) + C_2 \sin(\theta k) \bigr) , \end{align} for $\cos \theta = \frac{1+\beta}{2} \sqrt{\frac{1-\alpha \lambda_j}{\beta}}$ and some constants $C_1$, $C_2$ depend on the initial point.
In all cases, we can bound the convergence rate by \begin{align} \rho_{\alpha,\beta} = \rho(\bm T) = \max_j r_j = \max {r_1, r_n}, \quad \text{where } r_j = \begin{cases} \frac{1}{2} \Bigl( (1+\beta)\abs{1-\alpha \lambda_j} + \sqrt{\Delta_j} \Bigr) & \text{if } \Delta_j > 0, \\\ \sqrt{\beta \abs{1-\alpha \lambda_j}} & \text{otherwise.} \end{cases} \end{align} In order to converge, we need \begin{align} \rho_{\alpha,\beta} < 1 \quad \Leftrightarrow \quad \beta>0 \text{ and } \max \Bigl\{ 0, \frac{\beta-1}{\mu \beta} \Bigr\} < \alpha < \frac{2(\beta+1)}{L(2\beta+1)} . \tag{7} \label{equ:NAG_range} \end{align}
Choosing the optimal step size in this range can be proceeded as follows. Let us first rewrite $\Delta_j$ as
\[\Delta_j = (1-\alpha \lambda)^2 \biggl( \beta - \frac{1-\sqrt{\alpha \lambda_j}}{1+\sqrt{\alpha \lambda_j}} \biggr) \biggl( \beta - \frac{1+\sqrt{\alpha \lambda_j}}{1-\sqrt{\alpha \lambda_j}} \biggr) .\]Thus, if $\alpha \lambda_j > 1$, $r_j=\frac{1}{2} \Bigl( (1+\beta)(\alpha \lambda_j-1) + \sqrt{\Delta_j} \Bigr)$ is an increasing function of $\beta$. Otherwise, $r_j(\beta)$ is decreasing in the interval $0<\beta<\frac{1-\sqrt{\alpha \lambda_j}}{1+\sqrt{\alpha \lambda_j}}$ and increasing in the interval $\beta>\frac{1-\sqrt{\alpha \lambda_j}}{1+\sqrt{\alpha \lambda_j}}$, with the minimum at $r_j\Bigl(\frac{1-\sqrt{\alpha \lambda_j}}{1+\sqrt{\alpha \lambda_j}}\Bigr) = 1-\sqrt{\alpha \lambda_j}$. Thus, we can break it into $3$ cases:
If $\alpha \leq \frac{1}{L}$, then \begin{align} \max \{ r_1, r_n \} \geq r_n \geq r_n \biggl( \frac{1-\sqrt{\alpha \mu}}{1+\sqrt{\alpha \mu}} \biggr) = 1-\sqrt{\alpha \mu} \geq 1-\sqrt{\frac{\mu}{L}} . \end{align} In this case, the equality holds at $\alpha = \frac{1}{L}$ and $\beta=\frac{\sqrt{L}-\sqrt{\mu}}{\sqrt{L}+\sqrt{\mu}}$. However, the rate $\rho = 1-\sqrt{\frac{\mu}{L}}$ is suboptimal.
- If $\frac{1}{L} < \alpha \leq \frac{1}{\mu}$, then
- If $\frac{1-\sqrt{\alpha \mu}}{1+\sqrt{\alpha \mu}} \leq \beta \leq \frac{1+\sqrt{\alpha \mu}}{1-\sqrt{\alpha \mu}}$, we have \begin{align} \max \{ r_1, r_n \} = \max \biggl\{ \frac{1}{2} \Bigl( (1+\beta)(\alpha L-1) + \sqrt{\Delta_1} \Bigr) , \sqrt{\beta(1-\alpha \mu)} \biggr\} . \end{align} Notice that both terms inside $\max$ are increasing function of $\beta$. However, the former term is an increasing function of $\alpha$ while the later is a decreasing function of $\alpha$. Thus, minimizing over $\alpha$ and $\beta$ in the aforementioned range yields the solution where the two terms are equal and $\beta$ is at its minimum value, i.e., \begin{align} \beta=\frac{1-\sqrt{\alpha \mu}}{1+\sqrt{\alpha \mu}} \quad \text{and} \quad \frac{1}{2} \Bigl( (1+\beta)(\alpha L-1) + \sqrt{\Delta_1} \Bigr) = \sqrt{\beta(1-\alpha \mu)} . \end{align} Solving this yields $\alpha = \frac{4}{3L+\mu}$ and $\beta=\frac{\sqrt{3L+\mu}-2\sqrt{\mu}}{\sqrt{3L+\mu}+2\sqrt{\mu}}$, with the optimal rate \begin{align} \rho = 1-\frac{2\sqrt{\mu}}{\sqrt{3L+\mu}} \leq 1-\sqrt{\frac{\mu}{L}} . \end{align}
- If $\beta < \frac{1-\sqrt{\alpha \mu}}{1+\sqrt{\alpha \mu}}$, we have \begin{align} \max \{ r_1, r_n \} = \max \biggl\{ \frac{1}{2} \Bigl( (1+\beta)(\alpha L-1) + \sqrt{\Delta_1} \Bigr) , \frac{1}{2} \Bigl( (1+\beta)(1-\alpha \mu) + \sqrt{\Delta_n} \Bigr) \biggr\} . \end{align} We need to show that the maximum is always greater than $1-\frac{2\sqrt{\mu}}{\sqrt{3L+\mu}}$, which is not straightforward to me though!
- The case $\beta > \frac{1+\sqrt{\alpha \mu}}{1-\sqrt{\alpha \mu}}$ does not converge since \begin{align} \max \{ r_1, r_n \} \geq \sqrt{\beta (1-\alpha \mu)} > \frac{1+\sqrt{\alpha \mu}}{1-\sqrt{\alpha \mu}} (1-\alpha \mu) > 1 . \end{align}
- If $\alpha > \frac{1}{\mu}$, then \begin{align} \max \{ r_1, r_n \} = r_1 = \frac{1}{2} \Bigl( (1+\beta)(\alpha L-1) + \sqrt{\Delta_1} \Bigr) \geq \alpha L - 1 > \frac{L}{\mu} - 1 \geq 1-\frac{\mu}{L} \geq 1-\sqrt{\frac{\mu}{L}} . \end{align}
To conclude, the optimal choice of step sizes is given by
\[\alpha_{opt}=\frac{4}{3L+\mu}, \quad \beta_{opt}=\frac{\sqrt{3\kappa+1}-2}{\sqrt{3\kappa+1}+2}, \quad \rho_{opt}=1-\frac{2}{\sqrt{3\kappa+1}} . \tag{8} \label{equ:NAG_optimal}\]With this optimal choice, the number of iterations to reach (relative) $\epsilon$-accuracy satisfies \begin{align} \rho_{opt}^k < \epsilon \Rightarrow k > \frac{\log 1/\epsilon}{\log \bigl( 1-\frac{2}{\sqrt{3\kappa+1}} \bigr)} \approx \frac{\sqrt{3\kappa+1}}{2} \log \frac{1}{\epsilon} . \end{align} The iteration complexity of NAG with optimal step size is the same as HB - $O(\sqrt{\kappa} \log \frac{1}{\epsilon})$.
As a final note, HB generally does not converge when the objective is non-quadratic, strongly convex and smooth. On the other hand, NAG can be guaranteed to converge as long as the step sizes are chosen properly (see [2]). In the next note, we will study the proof of convergence of NAG on non-quadratic case.
References
- B. T. Polyak, Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5), 1–17, 1964.
- Y. Nesterov, Introductory lectures on convex optimization: a basic course, Kluwer Academic Publishers, 2004.
- L. Lessard, B. Recht, & A. Packard, Analysis and design of optimization algorithms via integral quadratic constraints. SIAM Journal on Optimization, 26(1), 57–95, 2016.