The NIPS Experiment

TL;DR: Half the papers appearing at NIPS would be rejected if the review process were rerun.


[This is a guest post by Eric Price.]

I was at NIPS in Montreal last week, which has a really advanced format relative to the theory conferences. The double blind reviewing, rebuttal phase, and poster+lightning talk system all seem like improvements on the standard in TCS, and having 2400 attendees is impressive (and overwhelming). But the most amazing thing about the conference organization this year was the NIPS consistency experiment.

A perennial question for academics is how accurate the conference review and acceptance process is. Getting papers into top conferences is hugely important for our careers, yet we all have papers rejected that we think should have gotten in. One of my papers was rejected three times before getting into SODA — as the best student paper. After rejections, we console ourselves that the reviewing process is random; yet we take acceptances as confirmation that our papers are good. So just how random is the reviewing process?  The NIPS organizers decided to find out.

The NIPS Experiment

The NIPS consistency experiment was an amazing, courageous move by the organizers this year to quantify the randomness in the review process. They split the program committee down the middle, effectively forming two independent program committees. Most submitted papers were assigned to a single side, but 10% of submissions (166) were reviewed by both halves of the committee. This let them observe how consistent the two committees were on which papers to accept.  (For fairness, they ultimately accepted any paper that was accepted by either committee.)

The results were revealed this week: of the 166 papers, the two committees disagreed on the fates of 25.9% of them: 43. [Update: the original post said 42 here, but I misremembered.] But this “25%” number is misleading, and most people I’ve talked to have misunderstood it: it actually means that the two committees disagreed more than they agreed on which papers to accept. Let me explain.

The two committees were each tasked with a 22.5% acceptance rate. This would mean choosing about 37 or 38 of the 166 papers to accept1. Since they disagreed on 43 papers total, this means one committee accepted 21 papers that the other committee rejected and the other committee accepted 22 papers the first rejected, for 21 + 22 = 43 total papers with different outcomes. Since they accepted 37 or 38 papers, this means they disagreed on 21/37 or 22/38 ≈ 57% of the list of accepted papers.

In particular, about 57% of the papers accepted by the first committee were rejected by the second one and vice versa. In other words, most papers at NIPS would be rejected if one reran the conference review process (with a 95% confidence interval of 40-75%):

  nips-pie1Most papers accepted by one committee were rejected by the other, and vice versa.

This result was surprisingly large to most people I’ve talked to; they generally expected something like 30% instead of 57%. Relative to what people expected, 57% is actually closer to a purely random committee, which would only disagree on 77.5% of the accepted papers on average:

randomgraph2If the committees were purely random, at a 22.5% acceptance rate they would disagree on 77.5% of their acceptance lists on average.

In the next section, I’ll discuss a couple simple models for the conference review process that give the observed level of randomness.

Models for conference acceptance

One rough model for paper acceptance, consistent with the experiment, is as follows:

  1. Half the submissions are found to be poor and reliably rejected.
  2. The other half are accepted based on an unbiased coin flip.

This might be a decent rule of thumb, but it’s clearly missing something: some really good papers do have a chance of acceptance larger than one half.

The “messy middle” model

One simple extension to the above model is the “messy middle” model, where some papers are clear accepts; some papers are clear rejects; and the papers in the middle are largely random.   We can compute what kinds of parameters are consistent with the NIPS experiment.  Options include:

  1. The above model. Half the papers are clear rejects, and everything
    else is random.
  2. The opposite. 7% of all papers (i.e. 30% of accepted papers) are
    clear accepts, and the  other 93% are random.
  3. Somewhere in the middle. For example 6% of all papers (i.e. 25% of accepted papers) are clear accepts, 25% of submitted papers are clear rejects, and the rest are random.messymiddle

The “noisy scoring” model

As I was discussing this over dinner, Jacob Abernethy proposed a “noisy scoring” model based on his experience as an area chair. Each paper typically gets three reviews, each giving a score on 0-10. The committee uses the average score2 as the main signal for paper quality. As I understand it, the basic committee process was that almost everything above 6.5 was accepted, almost everything below 6 was rejected, and the committee mainly debated the papers in between.

A basic simplified model of this would be as follows. Each paper has a “true” score {v} drawn from some distribution (say, {N(0, \sigma_{between}^2)}), and the the reviews for the paper are drawn from {N(v, \sigma_{within}^2)}. Then the NIPS experiment’s result (number of papers in which the two committees disagree) is a function of the ratio {\sigma_{between}/\sigma_{within}}. We find that the observation would be consistent with this model if {\sigma_{within}} is between one and four times {\sigma_{between}}:


Once the NIPS review data is released, we can check the empirical {\sigma_{within}} and {\sigma_{between}} to see if this model is reasonable.

One nice thing about the noisy scoring model is that you don’t actually need to run the NIPS experiment to estimate the parameters. Every CS conference could measure the within-paper and between-paper variance in reviewer scores. This lets you measure the expected randomness in the results of the process, assuming the model holds.


Computer science conference acceptances seem to be more random than we had previously realized. This suggests that we should rethink the importance we give to them in terms of the job search, tenure process, etc.

I’ll close with a few final thoughts:

  • Consistency is not the only goal. Double-blind reviewing probably decreases consistency by decreasing the bias towards established researchers, but this is a good thing and the TCS conferences should adopt the system.
  • Experiments are good! As scientists, we ought to do more experiments on our processes. The grad school admissions process seems like a good target for this, for example.
  • I’d like to give a huge shout-out to the NIPS organizers, Corinna Cortes and Neil Lawrence, for running this experiment. It wasn’t an easy task — not only did they review 10% more papers than necessary, they also had the overhead of finding and running two independent PCs. But the results are valuable for the whole computer science community.


  1. The committees did not know which of the ~900 papers they were reviewing were the 166 duplicated ones, so there can be some variation in how many papers to accept, but this is a minor effect.
  2. They also use a “confidence-weighted” score, but let’s ignore that detail.

Robustness versus Acceleration

My blog post on the Zen of Gradient Descent hit the front page of Hacker News the other day. I don’t know how that happened. It got me more views in one day than this most humble blog usually gets in half a year. I thought I should take this as an excuse to extend the post a bit by elaborating on one remark I made only in passing. You don’t need to go back to reading that post unless you want to. This one will be self contained.

The point I made is that basic Gradient Descent (GD) is noise tolerant in a way that Accelerated Gradient Descent (AGD) is not. That is to say, if we don’t have exact but rather approximate gradient information, GD might very well outperform AGD even though its convergence rate is worse in the exact setting. The truth is I was sort of bluffing. I didn’t actually have a proof of a formal statement that would nail down this point in a compelling way. It was more of a gut feeling based on some simple observations.

To break the suspense, I still haven’t proved the statement I vaguely thought was true back then, but fortunately somebody else had already done that. This is a thought provoking paper by Devolder, Glineur and Nesterov (DGN). Thanks to Cristobal Guzman for pointing me to this paper. Roughly, what they show is that any method that converges faster than the basic gradient descent method must accumulate errors linearly with the number of iterations. Hence, in various noisy settings auch as are common in applications acceleration may not help—in fact, it can actually make things worse!

I’ll make this statement more formal below, but let me first explain why I love this result. There is a tendency in algorithm design to optimize computational efficiency first, second and third and then maybe think about some other stuff as sort of an add-on constraint. We generally tend to equate faster with better. The reason why this is not a great methodology is that sometimes acceleration is mutually exclusive with other fundamental design goals. A theory that focuses primarily on speedups without discussing trade-offs with robustness misses a pretty important point.

When acceleration is good

Let’s build some intuition for the result I mentioned before we go into formal details. Consider my favorite example of minimizing a smooth convex function {f\colon \mathbb{R}^n\rightarrow\mathbb{R}} defined as

\displaystyle f(x) = \frac 12 x^T L x - b^T x

for some positive semidefinite {n\times n} matrix {L} and a vector {b\in\mathbb{R}^n.} Recall that the gradient is {\nabla f(x)=Lx-b.} An illustrative example is the Laplacian  of a cycle graph:

\displaystyle L = \left[ \begin{array}{ccccccc} 2 & -1 & 0 & 0 & 0 & \cdots & -1 \\ -1 & 2 & -1 & 0 & 0 & \cdots & 0 \\ 0 & -1 & 2 & -1& 0 & \cdots & 0 \\ \vdots & & & & & & \end{array} \right]

Since {L} is positive semidefinite like any graph Laplacian, the function {f} is convex. The operator norm of {L} is bounded by~{4} and so we have that for all {x,y\in\mathbb{R}^n:}

\displaystyle \|\nabla f(x) -\nabla f(y)\| \le \|L\|\cdot\|x-y\|\le 4 \|x-y\|.

This means the function is also smooth and we can apply AGD/GD with a suitable step size. Comparing AGD and GD on this instance with {n=100}, we get the following picture:


It looks like AGD is the clear winner. GD is pretty slow and takes a few thousand iterations to decrease the error by an order of magnitude.

When acceleration is bad

The situation changes dramatically in the presence of noise. Let’s repeat the exact same experiment but now instead of observing {\nabla f(x)} for any given {x,} we can only see {\nabla f(x) + \xi} where {\xi} is sampled from the {n}-dimensional normal distribution {N(0,\sigma^2)^n.} Choosing {\sigma=0.1} we get the following picture:


Gradient descent pretty quickly converges to essentially the best result that we can hope for given the noisy gradients. In contrast, AGD goes totally nuts. It doesn’t converge at all and it adds up errors in sort of linear fashion. In this world, GD is the clear winner.

A precise trade-off

The first thing DGN do in their paper is to define a general notion of inexact first order oracle. Let’s recall what an exact first-order oracle does for an (unconstrained) convex function {f\colon\mathbb{R}^n\rightarrow \mathbb{R}} with Lipschitz constant {L.} Given any point {x\in\mathbb{R}^n} an exact first order oracle returns a pair {(f_L(x),g_L(x))\in\mathbb{R}\times\mathbb{R}^n} so that for all {y\in\mathbb{R}^n} we have

\displaystyle 0\le f(y) - \big(f_L(x) + \langle g_L(x),y-x\rangle\big)\le \frac L2\|y-x\|^2\,.

Pictorially, at every point {x} the function can be sandwiched between a tangent linear function specified by {(f_L(x),g_L(x))} and a parabola. The pair {(f(x),\nabla f(x))} satisfies this constraint as the first inequality follows from convexity and the second from the Lipschitz condition. In fact, this pair is the only pair that satisfies these conditions. My slightly cumbersome way of desribing a first-order oracle was only so that we may now easily generalize it to an inexact first-order oracle. Specifically, an inexact oracle returns for any given point {x\in\mathbb{R}^n} a pair {(f_{\delta,L}(x),g_{\delta,L}(x))\in\mathbb{R}\times\mathbb{R}^n} so that for all {y\in\mathbb{R}^n} we have

\displaystyle 0\le f(y) - \big(f_{\delta,L}(x) + \langle g_{\delta,L}(x),y-x\rangle\big)\le \frac L2\|y-x\|^2+\delta\,.

It’s the same picture as before except now there’s some {\delta} slack between the linear approximation and the parabola.
With this notion at hand, what DGN show is that given access to {\delta}-inexact first-order oracle Gradient Descent spits out a point {x^t} after {t} steps so that

\displaystyle f(x^t) - \min_x f(x) \le O\big(L/t\big) + \delta\,.

The big-oh notation is hiding the squared distance between the optimum and the starting point. Accelerated Gradient Descent on the other hand gives you

\displaystyle f(x^t) - \min_x f(x) \le O\big(L/t^2\big) + O\big(t \delta\big)\,.

Moreover, you cannot improve this-tradeoff between acceleration and error accumulation. That is any method that converges as {1/t^2} must accumulate errors as {t\delta.}

Open questions

  1. The lower bound I just mentioned in the previous paragraph stems from the fact that an inexact first-order oracle can embed non-smooth optimization problems for which a speedup is not possible. This is interesting, but it doesn’t resolve, for example, the question of whether there could be a speedup in the simple gaussian noise addition model that I mentioned above. This isn’t even a toy model—as you might object—since gaussian noise addition is what you would do to make gradient descent privacy preserving. See for example an upcoming FOCS paper by Bassily, Smith, Thakurta for an analysis of gradient descent with gaussian noise.
  2. Is there an analog of the DGN result in the eigenvalue world? More formally, can we show that any Krylov subspace method that converges asymptotically faster than the power method must accumulate errors?
  3. The cycle example above is often used to show that any blackbox gradient method requires at least {t\ge \Omega(1/\sqrt{\epsilon})} steps to converge to error {\epsilon} provided that {t} is less than the number of vertices of the cycle, that is the dimension {n} of the domain. (See, for example, Theorem 3.9. in Sebastien Bubeck’s book.) Are there any lower bounds that hold for {t\gg n}?


The code for these examples is available here.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

Pearson’s polynomial

A 120 year old algorithm for learning mixtures of gaussians turns out to be optimal

So, how was math writing in 1894? I’d imagined it to be a lot like one of Nietzsche’s last diary entries except with nouns replaced by German fraktur symbols, impenetrable formalisms and mind-bending passive voice constructions. I was in for no small surprise when I started reading Karl Pearson’s Contributions to the Mathematical Theory of Evolution. The paper is quite enjoyable! The math is rigorous yet not overly formal. Examples and philosophical explanations motivate various design choices and definitions. A careful experimental evaluation gives credibility to the theory.

The utterly mind-boggling aspect of the paper however is not how well-written it is, but rather what Pearson actually did in it and how far ahead of his time he was. This is the subject of this post.

Pearson was interested in building a mathematical theory for evolutionary biology. In hindsight, his work is also one of the earliest contributions to the computational theory of evolution. This already strikes me as visionary as it remains a hot research area today. The Simons Institute in Berkeley just devoted a semester-long program to it.

In his paper, he considers the distribution of a single measurement among a population, more concretely, the forehead width to body length ratio among a population of crabs. The crab data had been collected by zoologist Weldon and his wife during their 1892 easter vacation on Malta. Wikipedia describes the crab (carcinus maenas) as one of the “world’s worst alien invasive species”, but fails to acknowledge the crab’s academic contributions.

The Pearson-Weldon crab: Evil alien invader or misjudged scientific apprentice?

The Pearson-Weldon crab: Evil alien invader or misjudged scientific apprentice?

The Weldons had taken 1000 samples each with 23 attributes of wich all but the one attribute describing forehead to body length ratio seemed to follow a single normal distribution. Regarding this peculiar deviation from normal, Pearson notes:

In the case of certain biological, sociological, and economic measurements there is, however, a well-marked deviation from this normal shape, and it becomes important to determine the direction and amount of such deviation. The asymmetry may arise from the fact that the units grouped together in the measured material are not really homogeneous. It may happen that we have a mixture of {2, 3,\dots, n} homogeneous groups, each of which deviates about its own mean symmetrically and in a manner represented with sufficient accuracy by the normal curve.

What the paragraph describes is the problem of learning a mixture of gaussians: Each sample is drawn from one of several unknown gaussian distributions. Each distribution is selected with an unknown mixture probability. The goal is to find the parameters of each gaussian distribution as well as the mixing probabilities. Pearson focuses on the case of two gaussians which he believes is of special importance in the context of evolutionary biology:

A family probably breaks up first into two species, rather than three or more, owing to the pressure at a given time of some particular form of natural selection.

With this reasoning he stipulates that the crab measurements are drawn from a mixture of two gaussians and aims to recover the parameters of this mixture. Learning a mixture of two gaussians at the time was a formidable task that lead Pearson to come up with a powerful approach. His approach is based on the method of moments which uses the empirical moments of a distribution to distinguish between competing models. Given {n} samples {x_1,...,x_n} the {k}-th empirical moment is defined as {\frac1n\sum_ix_i^k}, which for sufficiently large {n} will approximate the true moment {\mathbb{E}\,x_i^k}. A mixture of two one-dimensional gaussians has {5} parameters so one might hope that {5} moments are sufficient to identify the parameters.

Pearson derived a ninth degree polynomial {p_5} in the first {5} moments and located the real roots of this polynomial over a carefully chosen interval. Each root gives a candidate mixture that matches the first {5} moments; there were two valid solutions, among which Pearson selected the one whose {6}-th moment was closest to the observed empirical {6}-th moment.

Pearson goes on to evaluate this method numerically on the crab samples—by hand. In doing so he touches on a number of issues such as numerical stability and number of floating point operations that seem to me as being well ahead of their time. Pearson was clearly aware of computational constraints and optimized his algorithm to be computationally tractable.

Learning mixtures of gaussians

Pearson’s seminal paper proposes a problem and a method, but it doesn’t give an analysis of the method. Does the method really work on all instances or did he get lucky on the crab population? Is there any efficient method that works on all instances? The sort of theorem we would like to have is: Given {n} samples from a mixture of two gaussians, we can estimate each parameter up to small additive distance in polynomial time.

Eric Price and I have a recent result that resolves this problem by giving a computationally efficient estimator that achieves an optimal bound on the number of samples:

Theorem. Denoting by {\sigma^2} the overall variance of the mixture, {\Theta(\sigma^{12})} samples are necessary and sufficient to estimate each parameter to constant additive distance. The estimator is computationally efficient and extends to the {d}-dimensional mixture problem up to a necessary factor of {O(\log d)} in the sample requirement.

Strikingly, our one-dimensional estimator turns out to be very closely related to Pearson’s approach. Some extensions are needed, but I’m inclined to say that our result can be interpreted as showing that Pearson’s original estimator is in fact an optimal solution to the problem he proposed.

Until a recent breakthrough result by Kalai, Moitra and Valiant (2010), no polynomial bounds for the problem were known except under stronger assumptions on the gaussian components.

 A word about definitions

Of course, we need to be a bit more careful with the above statements. Clearly we can only hope to recover the parameters up to permutation of the two gaussian components as any permutation gives an identical mixture. Second, we cannot hope to learn the mixture probabilities in general up to small error. If, for example, the two gaussian components are indistinguishable, there is no way of learning the mixture probabilities—even though we can estimate means and variances easily by treating the distribution as one gaussian. On the other hand, if the gaussian components are distinguishable then it is not hard to estimate the mixture probabilities from the other parameters and the overall mean and variance of the mixture. For this reason it makes some sense to treat the mixture probabilities separately and focus on learning means and variances of the mixture.

Another point to keep in mind is that our notion of parameter distance should be scale-free. A reasonable goal is therefore to learn parameters {\hat \mu_1,\hat \mu_2,\hat\sigma_1,\hat\sigma_2} such that if {\mu_1,\mu_2,\sigma_1,\sigma_2} are the unknown parameters, there is a permutation {\pi\colon\{1,2\}\rightarrow\{1,2\}} such that for all {i\in\{1,2\},}

\displaystyle \max\left( \big|\mu_i-\mu_{\pi(i)}\big|^2, \big|\sigma_i^2-\sigma_{\pi(i)}^2\big|\right)\le\epsilon \cdot \sigma^2.

Here, {\sigma^2} is the overall variance of the mixture. The definition extends to the {d}-dimensional problem for suitable notion of variance (essentially the maximum variance in each coordinate) and by replacing absolute values with {\ell_\infty}-norms.

Some intuition for the proof

While the main difficulty in the proof is the upper bound, let’s start with some intuition for why we can’t hope to do better than {\sigma^{12}}. Just like Pearson’s estimator and that of Kalai, Moitra and Valiant, our estimator is based on the first six moments of the distribution. It is not hard to show that estimating the {k}-th moment up to error {\epsilon\sigma^k} requires {\Omega(1/\epsilon^2)} samples. Hence, learning each of the first six moments to constant error needs {\Omega(\sigma^{12})} samples. This doesn’t directly imply a lower bound for the mixture problem. After all, learning mixtures could be strictly easier than estimating the first six moments. Nevertheless, we know from Pearson that there are two mixtures that have matching first {5} moments. This turns out to give you a huge hint as to how to prove a lower bound for the mixture problem. The key observation is that two mixtures that have matching first {5} moments and constant parameters are extremely close to each other after adding a gaussian variable {N(0,\sigma^2)} to both mixtures for large enought {\sigma^2}. Note that we still have two mixtures of gaussians after this operation and the total variance is {O(\sigma^2).}

Making this statement formal requires choosing the right probability metric. In our case this turns out to be the squared Hellinger distance. Sloppily identifying distributions with density functions like only a true computer scientist would do, the squared Hellinger distance is defined as

\displaystyle \mathrm{H}^2(f,g) = \frac12\int_{-\infty}^{\infty} \Big(\sqrt{f(x)}-\sqrt{g(x)}\Big)^2 \mathrm{d}x\,.

Lemma. Let {f,g} be mixtures with matching first {k} moments and constant parameters. Let {p,q} be the distributions obtained by adding {N(0,\sigma^2)} to each distribution. Then, {\mathrm{H}^2(p,q)\le O(1/\sigma^{2k+2})}.

The lemma immediately gives the lower bound we need. Indeed, the squared Hellinger distance is sub-additive on product distributions. So, it’s easy to see what happens for {n} independent samples from each distribution. This can increase the squared Hellinger distance by at most a factor {n}. In particular, for any {n=o(\sigma^{2k+2})} the squared Hellinger distance between {n} samples from each distribution is at most {o(1)}. Owing to a useful relation between the Hellinger distance and total variation, this implies that also the total variation distance is at most {o(1)}. Using the definition of total variation distance, this means no estimator (efficient or not) can tell apart the two distributions from {n=o(\sigma^{2k+2})} samples with constant success probability. In particular, parameter estimation is impossible.

Mixtures with identical first 5 moments before and after adding noise

This lemma is an example of a broader phenomenon: Matching moments plus “noise” implies tiny Hellinger distance. A great reference is Pollard’s book and lecture notes. By the way, Pollard’s expository writing is generally wonderful. For instance, check out his (unrelated) paper on guilt-free manipulation of conditional densities if you ever wondered about how to condition on a measure zero event without regret. Not that computer scientists are generally concerned about it.

Matching upper bound

The matching upper bound is a bit trickier. The proof uses a convenient notion of excess moments that was introduced by Pearson. You might have heard of excess kurtosis (the fourth excess moment) as it appears in the info box on every Wikipedia article about distributions. Excess moments have the appealing property that they are invariant under adding and subtracting a common term from the variance of each component.

Identical excess moments

The second excess moment is always {0.} As the picture suggests, we can make one component increasingly spiky and think of it as a distribution that places all its weight on a single point. In accordance with this notion of excess moments it makes sense to reparametrize the mixture in such a way that the parameters are also independent of adding and subtracting a common variance term. Assuming the overall mean of the mixture is {0,} this leaves us with three free parameters {\alpha,\beta,\gamma.} The third through sixth excess moment give us four equations in these three parameters.

Expressing the excess moments in terms of our new parameters {\alpha,\beta,\gamma,} we can derive in a fairly natural manner a ninth degree polynomial {p_5(y)} whose coefficients depend on the first {5} excess moments so that {\alpha} has to satisfy {p_5(\alpha)=0.} The polynomial {p_5} was already used by Pearson. Unfortunately, {p_5} can have multiple roots and this is to be expected since {5} moments are not sufficient to identify a mixture of two gaussians. Pearson computed the mixtures associated with each of the roots and threw out the invalid solutions (e.g. the ones that give imaginary variances), getting two valid mixtures that matched on the first five moments. He then chose the one whose sixth moment was closest to the observed sixth moment.

We proceed somewhat differently from Pearson after computing {p_5}. We derive another polynomial {p_6} (depending on all six moments) and a bound {B} such that {\alpha} is the only solution to the system of equations

\displaystyle \big\{p_5(y)=0,\quad p_6(y)=0,\quad 0<y\leq B\big\}.

This approach isn’t yet robust to small perturbations in the moments; for example, if {p_5} has a double root at {\alpha}, it may have no nearby root after perturbation. We therefore consider the polynomial {r(y) = p_5^2(y)+p_6^2(y)} which we know is zero at {\alpha.} We argue that {r(y)} is significantly nonzero for any {y} significantly far from {\alpha}. This is the main difficulty in the proof, but if you really read this far, it’s perhaps not a bad point to switch to reading our paper.


I sure drooled enough over how much I like Pearson’s paper. Let me conclude with a different practical thought. There were a bunch of tools (as in actual computer programs—gasp!) that made the work a lot easier. Before we proved the lower bound we had verified it using arbitrary floating point precision numerical integration to accuracy {10^{-300}} for {\sigma^2=10,100,1000,\dots} and the decrease in Hellinger distance was exactly what it should be. We used a Python package called mpmath for that.

Similarly, for the upper bound we used a symbolic computation package in Python called SymPy to factor various polynomials, perform substitutions and multiplications. It would’ve been tedious, time-consuming and error prone to do everything by hand (even if the final result is not too hard to check by hand).

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.