Passive-aggressive perceptrons and trust-region methods

Recently I've been reading up on some optimization things, and I think I saw an interesting connection between trust-region methods for nonlinear optimization (things like Levenberg-Marquadt) and the passive-aggressive (PA) family of online learning algorithms (as well as many proximal-gradient methods). To do so I think I'll first explain trust-region methods, then PA, and then the connection.

Trust-region methods, essentially, are a way of making sure that Newton's method works. In Newton's method you pick the function f you want to optimize, approximate it locally by a truncate taylor series, and minimize the approximation. So you're actually minimizing

Media_httpmathurlcom7_quket

where H is the hessian (the matrix of second derivatives) of f. This has an analytic minimum at

Media_httpmathurlcom6_dseqd

(you could invert the matrix but it's faster and arguably at least as stable to just solve the linear system above with least squares)

One problem with newton's method is that you're usually not solving a quadratic (otherwise f(x) = f'(x)), so there is an error in your taylor approximation (not only that, but in really bad cases your hessian can be not positive definite, which will lead to all sorts of weird behaviors when you try to solve the system above). There are, then, two strategies you can use to make it practical: line search (that is, trying many things in the diretion of H^-1 grad f and choosing the best), which is what is most commonly done, and trust-region methods, which consist of adding a diagonal term to the hessian to make the linear system "nicer" at the expense of a poorer approximation of f. So in a mile-high view all trust region methods would do as above using 

Media_httpmathurlcom7_nrjpg

instead of H.

Now to online learning. One problem with things like the perceptron is that they update willy-nilly, often taking unnecessarily big steps in the direction of something which was almost correctly classified and taking not nearly big enough steps in the direction of the actually hard examples. You can see the perceptron as doing SGD (technically, stochastic subgradient descent) in the following odd-looking loss function

Media_httpmathurlcom8_bqzgx

for y being either 1 or -1. The gradient of this is x, if it's misclassified, or 0, if it's correctly classified. The passive-aggressive family of online learning algorithm roughly tries to fix this eagerness in the perceptron by doing conservative updates. That is, rather than doing stochastic gradient steps to minimize the weird loss function above, take the smallest possible step needed to make it zero. Technically, it's the following optimization problem:

Media_httpmathurlcom7_fyoee

This problem is commonly solved by looking at its dual, which has an analytic solution. The lagrangian is

Media_httpmathurlcom7_hxllh

This is, of course, equivalent (in the sense that it leads to the same solution as)

Media_httpmathurlcom6_eyhag

and if you look at the Hessian of that, you'll see that all the first term does is add a multiple of the identity matrix to the Hessian of the original loss function (which doesn't even exist, as it's not quite differentiable). The way you end up tuning the alpha step size in PA is very similar to how you'd choose the lambda in a trust-region method, with the difference that you have a closed-form in PA which you don't in the trust-region method.

 

Hat tip to Michael Wick, with whom I discussed it this morning.

 

Algorithms versus ideas (or learning reductions versus perceptron updates)

One of the most entertaining parts of NIPS 2011 was sitting in Mark Reid's talk and the panel session in the Relations between machine learning problems workshop in Sierra Nevada. There was a long discussion on mapping the landscape of machine learning in a useful and compact way. This is clearly necessary, as there are currently hundreds of variations of even simple things such as binary classification that are only loosely related to each other. There are also grand unifying paradigms, such as structural risk minimization, neural networks, bayesian machine learning, etc, which tell incompatible stories as far as assumptions, inputs/outputs, and algorithms go, although some mapping between these approaches is possible (for example you can pick almost any neural network, tack a probability distribution on top of it and call it bayesian, or you can call your empirical risk a "likelihood" and your regularizer a "prior", and then trade days of computation time for more accuracy).

In the workshop I detected a bias towards the more formal, CS-y, kind of thinking, which is well expressed by the learning reductions formalism. In this paradigm you pick a simple, well-defined, baseline learning problem (such as binary classification with the 0-1 loss, but it could be really anything), and show how being able to build fairly accurate solutions to it allows you to build reasonably accurate solutions to more complex problems (and here SEARN works for structured learning, costing for cost-sensitive classification, etc), where "fairly" and "reasonably" are formally specified. This allows you to build a very CS-y map of the machine learning problems, the structures between them, and even how they relate to each other. In the future this might lead to interesting notions of completeness or hardness, etc.

In this spirit people at the workshop suggested that the way to go is formal specification of the interfaces between learning problems (going as far as suggesting learning-as-a-service), as this interface-specific knowledge would then lead to a better mapping of the relationships, and a better usage of reductions.

I'm not too happy with this view, however, for two reasons: first, because learning problems are social constructs, and simplifications of real-world needs, and secondly because this is not the pattern of reuse which has naturally developped, and it would be smart to understand this pattern before changing it.

For my first point I'd like to say that I've never seen a binary classification problem. What I've seen is many real-world, messy problems, mapped into the "binary classification" box so that they can be solved to a reasonable degree of accuracy. There is already a lot of heavy mapping going on between the needs of a practicioner and the abstraction of binary classification: you need to come up with some notion of instances which are reasonably independent, a labeled set of examples (and with these a labeling scheme) to train and test on, heuristic feature functions which are almost but not quite equivalent to the label, and finally a loss function that allows you to tell how right/wrong you are (which depends heavily on the labellings, which almost never occur naturally).

So all these issues are rightly swept under the rug when talking about building classifiers, because over time it was discovered that for a lot of problems there exist ways of making these mapping decisions which lead to useful results in the end, at a reasonable cost. At the same time, these mappings create binary classification problems with the same interface that are so different under the hood that entirely different abstractions are often used to attack them (think feature learning/engineering, online things, generative assumptions, etc).

My second point is, I think, even more important. If you look at the way machine learning has developed historically, researchers almost never trusted each other's results as black boxes to which you can plug things in the input and output slots. Completely opposite to the learning reductions approach is how perceptron updates came to be in the center of way too many facets of machine learning. Instead of looking for clear, well-defined, input-output black boxes, machine learning researchers have pried open problems looking for a place where you can pick your parameter vector and add it to it something easily computed from the training example and the true label in a way that your parameter vector has a higher dot product with the true label than with the wrong labels. This simple, ugly, idea, which depends on all sorts of artificial notions such as parameter vectors, true labels, training examples, etc, can actually be found lying around inside most machine learning algorithms, at least the efficient ones. It can be used from binary classification to machine translation, and pretty much every problem in between. The open-box idea of perceptron updates, or more generally gradient/mirror descent in generalized linear models (the fancy way of calling it), has managed to be more useful than any black-box abstraction as far as machine learning goes. This is, of course, not at all unique to the perceptron update, and I could very well say something like belief propagation (which will then encompass most message-passing and some non-message-passing algorithms used in ML, with a loose enough definition of algorithm).

So, while thinking in terms of nice taxonomies, clean interfaces, reductions, and complexity is very appealing, is it really the right thing to do? Can a learning reduction make any black-box algorithm as useful as the idea of a perceptron update?

Finally, should we be really looking at reusing black-box algorithms, or should we try to surface as clearly as possibly the white-box ideas which can later be reused in countless other algorithms?

I think rather than attempting to reconceive the world in our image it often turns out that the best way to move forward is to do what already works in a better way.

(of course I don't find reductions a bad idea at all. It is brilliant how some complex problems can be solved using them, and I'd like to work on them at some point. It's just that maybe a black box view of the field is not ideal)

#NIPS2011 Workshop day 1 Highlights

Today was the first day of the NIPS worksops, and as usual it was really interesting.

In the morning I mostly stayed at the Bayesian Optimization workshop, which was really interesting. Remy Munos talked about his ideas on optimizing highly nonconvex and nonsmooth functions with stochastic bandit feedback. The idea involves using a given hierarchical partition of the space to do some clever sampling, plus slight assumptions on smoothness of the function only around the optimum. Even if you don't know the actual smoothness around the optimum you can still get some pretty interesting results.

I found out about research that is very closely related to my recommendation systems work, by Roman Garnett and Katja Hofmann. I'm really looking forward to reading their papers on the subject, and possibly collaborating in these areas.

There was also Philip Hennig's poster about doing bayesian optimization where you try to keep track of explicit beliefs about where the optimum is and what is its value, rather than keeping track of beliefs about the function value everywhere. Pieter Abbeel's poster about safe exploration was really interesting. Apparently he has a generalization of minimax and expectimax objectives where you can tune one simple intuitive parameter to control how adversarial the world is, and route around that.

After this I watched the contributed talks at the Computational Tradeoffs in Learning workshop. Sameer Singh talked about his approximate parallel BP algorithm, which is really interesting. Suppose you run BP, and in doing so find out that the marginal on some nodes isn't changing all that much. Then you can fix the distribution of those nodes, condition on them, and decompose your graphical model into many independent parts, on which you can then run inference in parallel. Apparently at 95% accuracy or so you can get almost linear speedup with up to eight cores or so on some realistic models.

Then there was Hal Daumé's really interesting talk on learning time vs accuracy tradeoffs in complex inference problems. The setting is great, and I have high hopes for this line of research. Essentially, you have these nondeterministic inference problems, such as parsing, in which you can push and pop items into a chart in pretty much any order you want, and if you're willing to wait forever you will find the optimal value. His objective then is to learn a heuristic function used to sort items popped off of the agenda to minimize time at a small expense in accuracy. The problem ends up being so nondeterministic, however, that it's really hard to come up with reasonable gradients that lead you towards good solutions, so there is a lot of future work needed to make this approach feasible. Regardless, it's something I wish more inference techniques tried to do, and even if this fails some other approach should be tried.

Following that there was Jason Eisner's student Vaselin's talk on something pretty similar, but different. He is following up on his earlier work where you essentially treat your model, inference, and loss function computation as a big black box, which you optimize on your training data using gradients computed by automatic differentiation. This has all sorts of nice consequences, such as not needing to have the artificial distinction between model and inference (which gets really dodgy with approximate inference), getting a nice thing to optimize that leverages properties from your inference, and being able to use literally any structure you want. The problem, unfortunately, ends up highly non-convex (which makes sense). This talk was about extending this idea to account for explicit computational cost tradeoffs by modeling early stopping and factor selection as part of the cost function you can optimize over. It's pretty impressive, and the results are great.

In the afternoon I mostly stuck around the Relationships Between Machine Learning Problems workshop, which I generally found really interesting. I think machine learning as a field has solidified around some relatively suboptimal designs for problems and solutions, and exploring the relationships between these is a way of breaking out of this cycle. Mark Reid's talk on possible anatomies for a learning problem was really interesting, not so much for the solutions it proposed (which were arguable and hence endlessly discussed in the panel sessions) but for making a broad sweep of the usual current ways of talking about machine learning problems. Also, for exposing me to the term "nips bingo", which is writing papers by essentially stringing random buzzwords together (for an innocuous example, think of something like "A copula-based manifold optimization process for for semi-supervised identification of latent tree models"). The panel discussion was really enlightening, as was the follow-up discussion I had with Tiberio Caetano about some issues regarding current machine learning.

All in all this was really interesting, and I'm looking forward for tomorrow's sessions. I'll have a poster at the Challenges in Learning Hierarchical Models workshop, and I'm also planning on seeing some talks in the Philosphy and Machine Learning, Computational Social Science, Learning Semantics, and Domain Adaptation workshops. It'll be a lot of running around.

#NIPS2011 Day 4 Highlights & Workshop plans

Today was only a half day at NIPS, so no posters and very few talks.

There were two very interesting talks on learning large-scale mixture like things with very clever subsampling of the data. The first, Scalable Training of Mixture Models via Coresets, by Feldman et al, is explicitly about something I find really cool and interesting, which is coresets. It turns out that it's surprisingly difficult to extend the known coresets for the k-means problem into coresets for mixtures of gaussians, mostly because of the covariance. Once they managed to get that, however, they got really nice results on streaming and distributed algorithms for learning these coresets. The motivation is that you want to discard points more often in dense regions and less often in sparse regions, while at the same time inversely weighting the remaining points by their probability of being discarded. This is done in a higher-dimensional space that somehow helps encode covariance information. Then they can cheaply merge these coresets into larger coresets that represent the data more accurately. It's really impressive how well their algorithm degrades when you tighten the data budget (essentially not at all).

The other talk was Fast and Accurate k-means For Large Datasets, by Schindler et al. They want to approach a much easier problem, k-means, but the mechanics of the technique is eerily similar. The algorithm has a two-phase structure: first, one streaming pass over the data discards all points sufficiently close to known points (probably shifting the weight of the known points) to get a smartly subsampled dataset; then, a second pass goes over these weighted remaining points and computed the solution to a k-means problem. The first phase caused some people to misunderstanding it as approximating k-medians or some other aglomerative clustering technique, but it's really finding a coreset structure in a streaming setting and then using it to learn k-means. The highlight was the surprisingly cheeky method they used to find the nearest neighbor of a newly arrived streaming point: random projection (note the singular). They just projected all points in the coreset into one random direction and did a binary search in this direction to find some candidate points closest to the test point to avoid computing all n*k distances, which makes their algorithm faster than computing the quality of the k-means approximation.

Tomorrow is the first day of the workshops, which are my favorite part of nips, because the talks tend to be more interesting and the posters more creative (you can clearly tell that I don't ski). I'm planning on attending my talk in the Bayesian Optimization workshop at 8h40 (at Hotel Bar in Melia Sierra Nevada), Jenn Wortman Vaughn's talk at the relationships between machine learning problems workshop (at Dilar, Melia Sierra Nevada), the contributed short talks at the computational tradeoffs workshop (at Montebajo: Basketball court), and the afternoon talks of Peter Grunwald, Mark Reid, and Dario Garcia at the relations between machine learning problems workshop, again.

#NIPS2011 Day 3 Highlights

Today was, I think, the most interesting day in NIPS by far.

It started really well with Bernard Schölkopf's talk about causality. He talked mostly about his recent research (and not other also interesting directions) on identifying causal-like relationships without interventions. The key idea is that with some somewhat defensible modeling assumptions, like additive heteroskedatic noise (that is, caused variables are generated from their causes by computing a deterministic nonlinear function and adding noise with a fixed variance) or the fact that P(cause) is independent of the function f(cause) that generates the caused variable you can decide which variables cause which, or, with some kernel-based techniques, whether there is a confounding hidden variable.

Of course, what this is really saying is that these seemingly innocuous assumptions are actually very limiting. There are, of course, many philosophical issues with looking at causality in graphical models, many obvious ones (are there causes?, as someone asked) and many non-obvious ones (most actual data is on very aggregate variables instead of individual events, and it is really arguable whether changing the rate of X can really cause the rate of Y to change, as causality is a relationship between local events and not aggregate statistics), but this whole direction is really interesting.

There was also an interesting spotlight on Object Detection with Grammar Models by Girshick et al, but I didn't quite catch enough detail to explain it.

The talk on Efficient inference in CRFs with fully connected Gaussian potentials by Koltun et al was also really interesting. It turns out that the main message-passing step of a variational approximation of this kind of CRF with gaussian random variables is a convolution with a high-dimensional Gaussian filter, and there are very efficient techniques to do that which can avoid the obvious n^2 step and make using these models feasible where otherwise they wouldn't be.

The paper on Practical Variational Inference for Neural Networks by Alex Graves also turned out to be really interesting. The model is essentially a batch version of confidence-weighted learning, and he managed to come up with enough tricks to make it really fast.

Jake Abernathy's talk on A Collaborative Mechanism for Crowdsourcing Prediction Problems was the best NIPS talk so far. Really nice motivation, and a very interesting step towards a solution to this problem. There are some really hard trade-offs here: you want to design incentives so that people who have information feel compelled to share it as soon as possible but at the same time you don't want payoffs to asymptote too quickly or there is no incentive to shave off the last couple of percent points, which are in practice much more expensive (exponentially so) than the first percentage points. There is also the issue of generalization: in practice these competitions degenerate into a race between ever-more-complex ensembles, which really hurts whoever wants to build a real system out of the results of these prediction competitions. Still, I really hope this area gets more researched, as I think it is really important. 

Now it's time to get ready for the workshops (my talk is 8h40 AM on the Bayesian optimization workshop, on friday, and there are some really interesting talks in the Computational tradeoffs, Relationships between learning problems, and Learning semantics workshops).

#NIPS2011 Day 2 Highlights

Yesterday most of the talks were about reinforcement learning or neuroscience, so they didn't quite catch my attention. Thankfully the organization moved the posters a bit further apart, which helped a lot. Also, most of the posters were papers I had already read, except for one (which wasn't available when I made my reading lists):

Message-Passing for Approximate MAP Inference with Latent Variables, by Jiang, Rai, and Daume III. They present another, simple, generalization of loopy belief propagation for the case of marginal MAP inference. The algorithm is really simple: nodes which you want to marginalize away send sum-product messages and nodes you want to maximize on send max-product messages. One thing which slightly confused me is that in their experiments max-product was always better than sum-product for the case of marginal map inference, but as far as I know most people rarely use max-product BP, instead doing sum-product and looking at max-marginals (or max-consistent marginals in the case of parsing or something similar), as it tends to converge more often and find better fixed points. Maybe this is related to the Hamming loss-like evaluation function, however. Still, this is something I see myself using in the future, as in practice all cases of inference in nontrivial graphical models involve some variables you added just to manage uncertainty and some variables you really want to be as accurate as possible on.

The talk by Po-Ling Loh on guarantees even in the case of non-convexity was also pretty interesting.

One trend I noticed yesterday as well was more than one paper on learning latent tree structures. I think I'll check this area out more closely after the conference.

#NIPS2011 Day 1 Highlights

The tutorials this year struck me as rather weak, but perhaps this is observer bias. First, it's odd that two tutorials were picked on what are very closely related topics (LP relaxations for inference and dual decomposition and lagrangian relaxations, instances of LP relaxations, for inference in NLP). Of these I liked the NLP tutorial more, perhaps because of the simultaneous focus on practical things and simple explanations of the theorems behind them.

The nonparametric bayesian tutorial was interesting, but didn't have much in the way of cool practical advice or deep new insights (rather, it helped shape some ideas I already had). Parts of the presentation were shaky as well. For example, they argue that in Bayesian inference the posterior always (except for some corner cases) converges to whatever generated the data. This is called Dobb's theorem. I wonder, however, how is it possible to apply this in non-identifiable settings, or, more generally, in settings with latent variables where the likelihood biases what kind of information about the latent variables can be recovered (for example, some bits of the latent variables might never be used to generate the observable data). If this theorem applies only in scenarios where there are no latent variables then it is really uninteresting.

Apparently they've managed to cut away all the funny bits of NIPS, making a rather boring introductory talk and completely removing the workshop skits.

Of the papers, the layout of the poster session was really bad, crowding over a thousand people in an unnecessarily small space, and grouping posters in sets of three, which had the unfortunate effect that middle-of-the-set presenters always ended up standing right in front of their posters.

My reading list was posted before, but as usual some papers of the day caught my attention.

Collective Graphical Models by Sheldon and Dietterich proposes and solves a rather neat problem where observables in a graphical model are really averages over large sets of replicas of the model. This arises all the time when, to borrow their example, fairly thorough samples are made of sequential processes at given time stamps which reveal how many things are in which state at any time but crucially not which of these things are in which state. This individual information is filled in with something EM-like.

Multiview Learning of Word Embeddings via CCA by Dhillon, Foster, and Ungar tries to learn things which loosely resemble brown clusters (and neural word embeddings) but which, crucially, are (1) token-specific, rather than type-specific (although it's not clear how they manage this) and (2) very fast by a linear model which is essentially CCA, so it's easy to implement and understand. Instead of grouping the entire context of a token in a bag, or just looking at words before/after, they use CCA to find the maximally correlated projection for each word given both its before- and after-context. They manage to get new state-of-the-art results on CONLL NER, which I find surprising as I had already written off data dataset as being too overfit already, beating all other approaches I know of, even without doing severe feature engineering (which the other approaches famously do).

All in all this was really interesting, and I'm looking forward to the next days.

The value of information in recommendation systems

If you look the right way at a recommendation system, in practice, you'll see a machine learning system that selects a few unlabeled examples, shows them to a user, eventually gets labels for some of those, and retrains itself to incorporate this new information. So effectively the act of making a recommendation is the only way such systems have to get information about the users, specially as in most user interfaces search functionality is deemphasized in relation to recommender functionality. These systems are effectively selecting their own training data.

Also, one of the motivations behind recommendation systems is that people on websites will generally search for similar things, so you might be able to show them what they would search for before they do it. However, if you assume, like it's usually done, that your objective is predicting some behavioral data that you have collected, there's a problem. Suppose I can give you a perfect recommender system, no errors, perfect precision and perfect recall. It will, then, necessarily, recommend to each user those things he was going to search for anyway and nothing else. This is obviously bad from a cost perspective: you spent a lot of cash hiring grumpy machine learners and now your revenue stayed the same. This suggests that the actual quality of a recommendation system is counterfactual: it's how much your actual revenue will increase if you use it, compared to what its value will be if you don't use it.

Finally there's another tricky issue. Let's say we're building a movie recommendation system. Suppose every day I search for silly action movies (Rambo 17, or whatever) and click on "one star". In most data driven settings you'll only observe that I am not liking the rambo movies - essentialy, that my probability of liking a rambo movie given that I wanted to watch it is small. However, the probability of me wanting to watch it is pretty high. Should a recommender system show me rambo movies? Or, more generally, what can you do when the data you collect (in this case P(rating)) is actually conditioned on one of the variables you want to predict (in this case P(want to watch)?

To start to tackle all these issues I'm presenting a paper from my internship work with Jurgen van Gael, Ralf Herbrich, Ulrich Paquet, and others, at Microsoft Research Cambridge, in the Bayesian Optimization, Experiment Design, and Bandits workshop.

The idea behind the paper is to treat the value of a recommender system as the quality of the recommendations it actually makes (not some abstract notion of matching the rank of user preferences or, god forbid, mean squared error on star ratings). If you assume the users are not adversarial (i.e., a poser, or someone who clicks "like" not because he likes something but to deceive your system) and their actions can be well-modeled by a probability distribution on which you have priors (maybe computed from a logged data set), you can write down the expected change in total value of this recommendation session as a function of which recommendations the system makes to the user (because then there is a probability of the user rating those items, and the system getting more information, and making better recommendations).

This provides natural solutions for many classic problems in recommendation systems, such as diversity (you now want to make diverse recommendations to maximize the information you'll get from the user), over-personalization (as long as there is uncertainty in the system, and you should probably use a tracking model anyway, the system will keep risking making bad recommendations on the off chance that the user might like something different), and cold-start (while you can't predict the tastes of a user you have never seen it is possible to make good recommendations by hedging your bets). We even have experimental data with real users showing that this in fact works.

In one sentence, my point is that every recommendation system is in practice an active learning agent, and a pretty naive one, so it should be very easy to do better and hedge your bets.

If you're interested, or if you think I'm wrong, I'll give a talk December 16th, 8h40 AM, on the Bayesian Optimization, Experimental Design, and Bandits NIPS workshop. The paper is also online, and I'd love to get some comments. If you can't make it to the talk I'll also be around in the poster sessions in that workshop.

Correlations and anticorrelations in LDA inference

This year I'll present two papers in NIPS workshops. The first is about something I think is an often overlooked issue in topic models, specially more complicated ones. This is all obviously my own personal views, and do not reflect those of my co-authors (Hanna and Andrew), although hopefully they'll not disagree too much with me.

The paper is called Correlations and Anticorrelations in LDA inference, and it is based on the following insight. In LDA the only reason you can learn crisp, semantic topics at all is that the prior has a very strong preference for using as few as possible topics in any given document. Not only that, but given a choice about whether to split a certain number of word tokens between two topics or put them all in one topic the solution where they are all in one topic always has higher likelihood. So what happens is that if each word in a document is really only explainable by a single topic nothing bad happens. Whenever a token is explainable by more than one topic, LDA inference will pick the topic for that document which more easily explain some other words in that document. Iterating this process,then, will make the topics "move apart" in space so that each topic ends up responsible for as few words as possible.

However, it's usually common in the final learned topics for some word types to appear with very high probability in more than one topic. If you try, then, to infer the document-specific distributions over topics for new documents, whenever that word happens to occur in a document the inference process will effectively pick one of the two possible explanatory topics. This means that those topics which share words, even if the prior distribution is completely indifferent to their co-occurrences, will co-occur less often than expected on the posterior, and much less often than those topics which do not share words.

So if you try to learn some structure such as correlations over the topics what happens is that you'll probably also learn a lot of spurious correlations induced not by the actual information in the topics but by which topics share how many words. A nice thing of this is that it can explain why things such as correlated LDA have a higher test-set likelihood even though they do not produce better topics: because this bias is the same in the training and test sets the model will accurately learn to predict topic covariances on the test data even though those covariances are artificial.

If you want to know more about this, or tell me what is wrong with this, I'll be happy to talk with you in the poster sessions in the Challenges in Learning Hierarchical Models: Transfer Learning and Optimization workshop, December 17th, in "Montebajo: Library" (whatever that means). These sessions will be 9h50 AM and 7h10 PM.

 

Bayesian optimization

Looking at the NIPS papers (and workshops) I noticed many references (some hidden) to Bayesian optimization as a useful subroutine, so I decided to investigate and see what it's all about. I found a great tutorial by Eric Brochu, Vlad Cora, and Nando de Freitas, and I decided to follow it and see what it's all about. Note that this is one of the techniques used in James Bergstra's paper on hyperparameter optimization, and he has optimized and tested code online to do this kind of optimization, so you should use that code rather than the one here.

Regardless, I thought it to be a useful exercise to implement it and see what happens.

Bayesian optimization is a really cool and simple way of globally optimizing a function which is very expensive to evaluate. So expensive, in fact, that you're not bothered by the fact that it involves optimizing over a gaussian process with something like simulated annealing (or really whatever gradient-free global optimization algorithm strikes your fancy). One example of this is hyperparameter optimization, where your neural network has over 10 different tuning parameters which interact in odd ways and essentially make your research irreproducible. Then, you really care about not running too many iterations of something silly such as simulated annealing over your real objective function and are willing to spend the computational budget of a few minutes to get a really good point to try out.

So how does it work? First, you assume you have evaluated your objective function on a bunch of arbitrarily chosen points of your space. Then, in each iteration, it fits a gaussian process over your function evaluations and searches over this gaussian process for the best point to try out next. This search is a global optimization of something called an acquisition function, which can be as simple as the expected value of your objective at the point plus the predicted variance at that same point (the UCB acquisition function) or the expected improvement evaluation your function at that point can have over your current best point. Then, you select your current best point, evaluate your expensive objective function in it and add it to the list. Repeat until you're satisfied that your objective value is no longer improving all that much. 

Bayesian optimization also makes sense in an intuitive way: essentially you're replacing a worst-case assumption (for example, doing grid search and hoping to find optimal points assumes that your function is Lispchitz with constant related to the size of the grid) on your function with an average-case assumption expressed as any kernel you want. Effectively instead of saying that (f(x')-f(x))/||x'-x|| is always smaller than K you say that f(x')-f(x) ~ Normal(0, k||x'-x||), which then allows you to fiddle with the norm (or even replace it with something entirely different), and also generalizes nicely to discrete or mixed discrete-continuous spaces (as long as you're still able to do a reasonable job of searching over your acquisition function).

Edit: Ruben Cantin has an efficient multithreaded C++ implementation with python bindings.