2019-02-26

Momentum, Curveball, and Nesterov Acceleration


This post explores some relationships between the recent Curveball optimization method (https://arxiv.org/abs/1805.08095, more recent version at https://openreview.net/forum?id=Sygx4305KQ) and Nesterov momentum (as popularized by http://proceedings.mlr.press/v28/sutskever13.html).

Suppose we are iteratively optimizing a scalar-valued function $f(x)$ wrt $x$ by gradient descent. We wish to take a step $z$ that minimizes $f(x + z)$, whose Taylor series is given by
$$f(x + z) = f(x) + z^\top f'(x) + \frac{1}{2} z^\top f''(x) z + \ldots$$
where $f'(x)$ denotes the gradient of $f$ evaluated at $x$, and similarly $f''(x)$ is the Hessian. Second-order methods locally model $f(x)$ by a second-order approximation $\hat{f}$:
$$\hat{f}(x + z) = f(x) + z^\top f'(x) + \frac{1}{2} z^\top f''(x) z,$$
according to which the optimal step is
$$z^\star = \operatorname{argmin}_z f(x) + z^\top f'(x) + \frac{1}{2} z^\top f''(x) z = (f''(x))^{-1} f'(x).$$
However, multiplying by the inverse Hessian is expensive.

The authors of Curveball propose instead to solve for $z^\star$ by gradient descent, and to interleave this inner optimization with the outer optimization over $x$:
\begin{align}
z &\leftarrow \rho z - \beta \frac{\partial \hat{f}(x + z)}{\partial z} \\
x &\leftarrow x + z
\end{align}
where $\rho, \beta$ are hyperparameters.
Now $\frac{\partial \hat{f}(x + z)}{\partial z} = f'(x) + f''(x) z$; if the approximation $\hat{f}$ were first-order instead, the $f''(x) z$ term would disappear and this would be equivalent to classical momentum.

But if we're going to use gradients to optimize $z$, then why make the second-order approximation at all? Why not let $\hat{f} = f$ and optimize the step $z$ to minimize $f(x + z)$:
\begin{align}
z &\leftarrow \rho z - \beta \frac{\partial f(x + z)}{\partial z} \\
x &\leftarrow x + z
\end{align}
By a change of variables $y = x + z$, the derivative $\frac{\partial f(x + z)}{\partial z} = \frac{\partial f(y)}{\partial y} \frac{\partial y}{\partial z}$ is simply $f'(x + z)$. From this it is easy to see that this update rule is practically equivalent to that for Nesterov momentum given in Sutskever et al 2013:
\begin{align}
z &\leftarrow \rho z - \beta f'(x + \rho z) \\ x &\leftarrow x + z.
\end{align}
The only difference is the weighting by $\rho$ of the tentative step in the evaluation point $x + \rho z$ of $f'$.

What does that mean? It means that if classical momentum is first-order and Curveball is second-order, well then Nesterov momentum is the infinite-order version of momentum. It's tempting to conclude that Curveball is a worse version of Nesterov momentum, as it makes an unnecessary second-order approximation. However, the Curveball paper goes on to adopt all the usual second-order tricks (GGN, trust region, etc.). It's unclear how the final algorithm performs relative to Nesterov momentum.

No comments:

Post a Comment