Passive-aggressive perceptrons and trust-region methods
Recently I've been reading up on some optimization things, and I think I saw an interesting connection between trust-region methods for nonlinear optimization (things like Levenberg-Marquadt) and the passive-aggressive (PA) family of online learning algorithms (as well as many proximal-gradient methods). To do so I think I'll first explain trust-region methods, then PA, and then the connection.
Trust-region methods, essentially, are a way of making sure that Newton's method works. In Newton's method you pick the function f you want to optimize, approximate it locally by a truncate taylor series, and minimize the approximation. So you're actually minimizing
where H is the hessian (the matrix of second derivatives) of f. This has an analytic minimum at
(you could invert the matrix but it's faster and arguably at least as stable to just solve the linear system above with least squares)
One problem with newton's method is that you're usually not solving a quadratic (otherwise f(x) = f'(x)), so there is an error in your taylor approximation (not only that, but in really bad cases your hessian can be not positive definite, which will lead to all sorts of weird behaviors when you try to solve the system above). There are, then, two strategies you can use to make it practical: line search (that is, trying many things in the diretion of H^-1 grad f and choosing the best), which is what is most commonly done, and trust-region methods, which consist of adding a diagonal term to the hessian to make the linear system "nicer" at the expense of a poorer approximation of f. So in a mile-high view all trust region methods would do as above using
instead of H.
Now to online learning. One problem with things like the perceptron is that they update willy-nilly, often taking unnecessarily big steps in the direction of something which was almost correctly classified and taking not nearly big enough steps in the direction of the actually hard examples. You can see the perceptron as doing SGD (technically, stochastic subgradient descent) in the following odd-looking loss function
for y being either 1 or -1. The gradient of this is x, if it's misclassified, or 0, if it's correctly classified. The passive-aggressive family of online learning algorithm roughly tries to fix this eagerness in the perceptron by doing conservative updates. That is, rather than doing stochastic gradient steps to minimize the weird loss function above, take the smallest possible step needed to make it zero. Technically, it's the following optimization problem:
This problem is commonly solved by looking at its dual, which has an analytic solution. The lagrangian is
This is, of course, equivalent (in the sense that it leads to the same solution as)
and if you look at the Hessian of that, you'll see that all the first term does is add a multiple of the identity matrix to the Hessian of the original loss function (which doesn't even exist, as it's not quite differentiable). The way you end up tuning the alpha step size in PA is very similar to how you'd choose the lambda in a trust-region method, with the difference that you have a closed-form in PA which you don't in the trust-region method.
Hat tip to Michael Wick, with whom I discussed it this morning.