# Robustness versus Acceleration

My blog post on the Zen of Gradient Descent hit the front page of Hacker News the other day. I don’t know how that happened. It got me more views in one day than this most humble blog usually gets in half a year. I thought I should take this as an excuse to extend the post a bit by elaborating on one remark I made only in passing. You don’t need to go back to reading that post unless you want to. This one will be self contained.

The point I made is that basic Gradient Descent (GD) is noise tolerant in a way that Accelerated Gradient Descent (AGD) is not. That is to say, if we don’t have exact but rather approximate gradient information, GD might very well outperform AGD even though its convergence rate is worse in the exact setting. The truth is I was sort of bluffing. I didn’t actually have a proof of a formal statement that would nail down this point in a compelling way. It was more of a gut feeling based on some simple observations.

To break the suspense, I still haven’t proved the statement I vaguely thought was true back then, but fortunately somebody else had already done that. This is a thought provoking paper by Devolder, Glineur and Nesterov (DGN). Thanks to Cristobal Guzman for pointing me to this paper. Roughly, what they show is that any method that converges faster than the basic gradient descent method must accumulate errors linearly with the number of iterations. Hence, in various noisy settings auch as are common in applications acceleration may not help—in fact, it can actually make things worse!

I’ll make this statement more formal below, but let me first explain why I love this result. There is a tendency in algorithm design to optimize computational efficiency first, second and third and then maybe think about some other stuff as sort of an add-on constraint. We generally tend to equate faster with better. The reason why this is not a great methodology is that sometimes acceleration is mutually exclusive with other fundamental design goals. A theory that focuses primarily on speedups without discussing trade-offs with robustness misses a pretty important point.

## When acceleration is good

Let’s build some intuition for the result I mentioned before we go into formal details. Consider my favorite example of minimizing a smooth convex function ${f\colon \mathbb{R}^n\rightarrow\mathbb{R}}$ defined as

$\displaystyle f(x) = \frac 12 x^T L x - b^T x$

for some positive semidefinite ${n\times n}$ matrix ${L}$ and a vector ${b\in\mathbb{R}^n.}$ Recall that the gradient is ${\nabla f(x)=Lx-b.}$ An illustrative example is the Laplacian  of a cycle graph:

$\displaystyle L = \left[ \begin{array}{ccccccc} 2 & -1 & 0 & 0 & 0 & \cdots & -1 \\ -1 & 2 & -1 & 0 & 0 & \cdots & 0 \\ 0 & -1 & 2 & -1& 0 & \cdots & 0 \\ \vdots & & & & & & \end{array} \right]$

Since ${L}$ is positive semidefinite like any graph Laplacian, the function ${f}$ is convex. The operator norm of ${L}$ is bounded by~${4}$ and so we have that for all ${x,y\in\mathbb{R}^n:}$

$\displaystyle \|\nabla f(x) -\nabla f(y)\| \le \|L\|\cdot\|x-y\|\le 4 \|x-y\|.$

This means the function is also smooth and we can apply AGD/GD with a suitable step size. Comparing AGD and GD on this instance with ${n=100}$, we get the following picture:

It looks like AGD is the clear winner. GD is pretty slow and takes a few thousand iterations to decrease the error by an order of magnitude.

The situation changes dramatically in the presence of noise. Let’s repeat the exact same experiment but now instead of observing ${\nabla f(x)}$ for any given ${x,}$ we can only see ${\nabla f(x) + \xi}$ where ${\xi}$ is sampled from the ${n}$-dimensional normal distribution ${N(0,\sigma^2)^n.}$ Choosing ${\sigma=0.1}$ we get the following picture:

Gradient descent pretty quickly converges to essentially the best result that we can hope for given the noisy gradients. In contrast, AGD goes totally nuts. It doesn’t converge at all and it adds up errors in sort of linear fashion. In this world, GD is the clear winner.

The first thing DGN do in their paper is to define a general notion of inexact first order oracle. Let’s recall what an exact first-order oracle does for an (unconstrained) convex function ${f\colon\mathbb{R}^n\rightarrow \mathbb{R}}$ with Lipschitz constant ${L.}$ Given any point ${x\in\mathbb{R}^n}$ an exact first order oracle returns a pair ${(f_L(x),g_L(x))\in\mathbb{R}\times\mathbb{R}^n}$ so that for all ${y\in\mathbb{R}^n}$ we have

$\displaystyle 0\le f(y) - \big(f_L(x) + \langle g_L(x),y-x\rangle\big)\le \frac L2\|y-x\|^2\,.$

Pictorially, at every point ${x}$ the function can be sandwiched between a tangent linear function specified by ${(f_L(x),g_L(x))}$ and a parabola. The pair ${(f(x),\nabla f(x))}$ satisfies this constraint as the first inequality follows from convexity and the second from the Lipschitz condition. In fact, this pair is the only pair that satisfies these conditions. My slightly cumbersome way of desribing a first-order oracle was only so that we may now easily generalize it to an inexact first-order oracle. Specifically, an inexact oracle returns for any given point ${x\in\mathbb{R}^n}$ a pair ${(f_{\delta,L}(x),g_{\delta,L}(x))\in\mathbb{R}\times\mathbb{R}^n}$ so that for all ${y\in\mathbb{R}^n}$ we have

$\displaystyle 0\le f(y) - \big(f_{\delta,L}(x) + \langle g_{\delta,L}(x),y-x\rangle\big)\le \frac L2\|y-x\|^2+\delta\,.$

It’s the same picture as before except now there’s some ${\delta}$ slack between the linear approximation and the parabola.
With this notion at hand, what DGN show is that given access to ${\delta}$-inexact first-order oracle Gradient Descent spits out a point ${x^t}$ after ${t}$ steps so that

$\displaystyle f(x^t) - \min_x f(x) \le O\big(L/t\big) + \delta\,.$

The big-oh notation is hiding the squared distance between the optimum and the starting point. Accelerated Gradient Descent on the other hand gives you

$\displaystyle f(x^t) - \min_x f(x) \le O\big(L/t^2\big) + O\big(t \delta\big)\,.$

Moreover, you cannot improve this-tradeoff between acceleration and error accumulation. That is any method that converges as ${1/t^2}$ must accumulate errors as ${t\delta.}$

## Open questions

1. The lower bound I just mentioned in the previous paragraph stems from the fact that an inexact first-order oracle can embed non-smooth optimization problems for which a speedup is not possible. This is interesting, but it doesn’t resolve, for example, the question of whether there could be a speedup in the simple gaussian noise addition model that I mentioned above. This isn’t even a toy model—as you might object—since gaussian noise addition is what you would do to make gradient descent privacy preserving. See for example an upcoming FOCS paper by Bassily, Smith, Thakurta for an analysis of gradient descent with gaussian noise.
2. Is there an analog of the DGN result in the eigenvalue world? More formally, can we show that any Krylov subspace method that converges asymptotically faster than the power method must accumulate errors?
3. The cycle example above is often used to show that any blackbox gradient method requires at least ${t\ge \Omega(1/\sqrt{\epsilon})}$ steps to converge to error ${\epsilon}$ provided that ${t}$ is less than the number of vertices of the cycle, that is the dimension ${n}$ of the domain. (See, for example, Theorem 3.9. in Sebastien Bubeck’s book.) Are there any lower bounds that hold for ${t\gg n}$?

## Pointers:

The code for these examples is available here.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

# Pearson’s polynomial

A 120 year old algorithm for learning mixtures of gaussians turns out to be optimal

So, how was math writing in 1894? I’d imagined it to be a lot like one of Nietzsche’s last diary entries except with nouns replaced by German fraktur symbols, impenetrable formalisms and mind-bending passive voice constructions. I was in for no small surprise when I started reading Karl Pearson’s Contributions to the Mathematical Theory of Evolution. The paper is quite enjoyable! The math is rigorous yet not overly formal. Examples and philosophical explanations motivate various design choices and definitions. A careful experimental evaluation gives credibility to the theory.

The utterly mind-boggling aspect of the paper however is not how well-written it is, but rather what Pearson actually did in it and how far ahead of his time he was. This is the subject of this post.

Pearson was interested in building a mathematical theory for evolutionary biology. In hindsight, his work is also one of the earliest contributions to the computational theory of evolution. This already strikes me as visionary as it remains a hot research area today. The Simons Institute in Berkeley just devoted a semester-long program to it.

In his paper, he considers the distribution of a single measurement among a population, more concretely, the forehead width to body length ratio among a population of crabs. The crab data had been collected by zoologist Weldon and his wife during their 1892 easter vacation on Malta. Wikipedia describes the crab (carcinus maenas) as one of the “world’s worst alien invasive species”, but fails to acknowledge the crab’s academic contributions.

The Pearson-Weldon crab: Evil alien invader or misjudged scientific apprentice?

The Weldons had taken 1000 samples each with 23 attributes of wich all but the one attribute describing forehead to body length ratio seemed to follow a single normal distribution. Regarding this peculiar deviation from normal, Pearson notes:

In the case of certain biological, sociological, and economic measurements there is, however, a well-marked deviation from this normal shape, and it becomes important to determine the direction and amount of such deviation. The asymmetry may arise from the fact that the units grouped together in the measured material are not really homogeneous. It may happen that we have a mixture of ${2, 3,\dots, n}$ homogeneous groups, each of which deviates about its own mean symmetrically and in a manner represented with sufficient accuracy by the normal curve.

What the paragraph describes is the problem of learning a mixture of gaussians: Each sample is drawn from one of several unknown gaussian distributions. Each distribution is selected with an unknown mixture probability. The goal is to find the parameters of each gaussian distribution as well as the mixing probabilities. Pearson focuses on the case of two gaussians which he believes is of special importance in the context of evolutionary biology:

A family probably breaks up first into two species, rather than three or more, owing to the pressure at a given time of some particular form of natural selection.

With this reasoning he stipulates that the crab measurements are drawn from a mixture of two gaussians and aims to recover the parameters of this mixture. Learning a mixture of two gaussians at the time was a formidable task that lead Pearson to come up with a powerful approach. His approach is based on the method of moments which uses the empirical moments of a distribution to distinguish between competing models. Given ${n}$ samples ${x_1,...,x_n}$ the ${k}$-th empirical moment is defined as ${\frac1n\sum_ix_i^k}$, which for sufficiently large ${n}$ will approximate the true moment ${\mathbb{E}\,x_i^k}$. A mixture of two one-dimensional gaussians has ${5}$ parameters so one might hope that ${5}$ moments are sufficient to identify the parameters.

Pearson derived a ninth degree polynomial ${p_5}$ in the first ${5}$ moments and located the real roots of this polynomial over a carefully chosen interval. Each root gives a candidate mixture that matches the first ${5}$ moments; there were two valid solutions, among which Pearson selected the one whose ${6}$-th moment was closest to the observed empirical ${6}$-th moment.

Pearson goes on to evaluate this method numerically on the crab samples—by hand. In doing so he touches on a number of issues such as numerical stability and number of floating point operations that seem to me as being well ahead of their time. Pearson was clearly aware of computational constraints and optimized his algorithm to be computationally tractable.

## Learning mixtures of gaussians

Pearson’s seminal paper proposes a problem and a method, but it doesn’t give an analysis of the method. Does the method really work on all instances or did he get lucky on the crab population? Is there any efficient method that works on all instances? The sort of theorem we would like to have is: Given ${n}$ samples from a mixture of two gaussians, we can estimate each parameter up to small additive distance in polynomial time.

Eric Price and I have a recent result that resolves this problem by giving a computationally efficient estimator that achieves an optimal bound on the number of samples:

Theorem. Denoting by ${\sigma^2}$ the overall variance of the mixture, ${\Theta(\sigma^{12})}$ samples are necessary and sufficient to estimate each parameter to constant additive distance. The estimator is computationally efficient and extends to the ${d}$-dimensional mixture problem up to a necessary factor of ${O(\log d)}$ in the sample requirement.

Strikingly, our one-dimensional estimator turns out to be very closely related to Pearson’s approach. Some extensions are needed, but I’m inclined to say that our result can be interpreted as showing that Pearson’s original estimator is in fact an optimal solution to the problem he proposed.

Until a recent breakthrough result by Kalai, Moitra and Valiant (2010), no polynomial bounds for the problem were known except under stronger assumptions on the gaussian components.

Of course, we need to be a bit more careful with the above statements. Clearly we can only hope to recover the parameters up to permutation of the two gaussian components as any permutation gives an identical mixture. Second, we cannot hope to learn the mixture probabilities in general up to small error. If, for example, the two gaussian components are indistinguishable, there is no way of learning the mixture probabilities—even though we can estimate means and variances easily by treating the distribution as one gaussian. On the other hand, if the gaussian components are distinguishable then it is not hard to estimate the mixture probabilities from the other parameters and the overall mean and variance of the mixture. For this reason it makes some sense to treat the mixture probabilities separately and focus on learning means and variances of the mixture.

Another point to keep in mind is that our notion of parameter distance should be scale-free. A reasonable goal is therefore to learn parameters ${\hat \mu_1,\hat \mu_2,\hat\sigma_1,\hat\sigma_2}$ such that if ${\mu_1,\mu_2,\sigma_1,\sigma_2}$ are the unknown parameters, there is a permutation ${\pi\colon\{1,2\}\rightarrow\{1,2\}}$ such that for all ${i\in\{1,2\},}$

$\displaystyle \max\left( \big|\mu_i-\mu_{\pi(i)}\big|^2, \big|\sigma_i^2-\sigma_{\pi(i)}^2\big|\right)\le\epsilon \cdot \sigma^2.$

Here, ${\sigma^2}$ is the overall variance of the mixture. The definition extends to the ${d}$-dimensional problem for suitable notion of variance (essentially the maximum variance in each coordinate) and by replacing absolute values with ${\ell_\infty}$-norms.

## Some intuition for the proof

While the main difficulty in the proof is the upper bound, let’s start with some intuition for why we can’t hope to do better than ${\sigma^{12}}$. Just like Pearson’s estimator and that of Kalai, Moitra and Valiant, our estimator is based on the first six moments of the distribution. It is not hard to show that estimating the ${k}$-th moment up to error ${\epsilon\sigma^k}$ requires ${\Omega(1/\epsilon^2)}$ samples. Hence, learning each of the first six moments to constant error needs ${\Omega(\sigma^{12})}$ samples. This doesn’t directly imply a lower bound for the mixture problem. After all, learning mixtures could be strictly easier than estimating the first six moments. Nevertheless, we know from Pearson that there are two mixtures that have matching first ${5}$ moments. This turns out to give you a huge hint as to how to prove a lower bound for the mixture problem. The key observation is that two mixtures that have matching first ${5}$ moments and constant parameters are extremely close to each other after adding a gaussian variable ${N(0,\sigma^2)}$ to both mixtures for large enought ${\sigma^2}$. Note that we still have two mixtures of gaussians after this operation and the total variance is ${O(\sigma^2).}$

Making this statement formal requires choosing the right probability metric. In our case this turns out to be the squared Hellinger distance. Sloppily identifying distributions with density functions like only a true computer scientist would do, the squared Hellinger distance is defined as

$\displaystyle \mathrm{H}^2(f,g) = \frac12\int_{-\infty}^{\infty} \Big(\sqrt{f(x)}-\sqrt{g(x)}\Big)^2 \mathrm{d}x\,.$

Lemma. Let ${f,g}$ be mixtures with matching first ${k}$ moments and constant parameters. Let ${p,q}$ be the distributions obtained by adding ${N(0,\sigma^2)}$ to each distribution. Then, ${\mathrm{H}^2(p,q)\le O(1/\sigma^{2k+2})}$.

The lemma immediately gives the lower bound we need. Indeed, the squared Hellinger distance is sub-additive on product distributions. So, it’s easy to see what happens for ${n}$ independent samples from each distribution. This can increase the squared Hellinger distance by at most a factor ${n}$. In particular, for any ${n=o(\sigma^{2k+2})}$ the squared Hellinger distance between ${n}$ samples from each distribution is at most ${o(1)}$. Owing to a useful relation between the Hellinger distance and total variation, this implies that also the total variation distance is at most ${o(1)}$. Using the definition of total variation distance, this means no estimator (efficient or not) can tell apart the two distributions from ${n=o(\sigma^{2k+2})}$ samples with constant success probability. In particular, parameter estimation is impossible.

This lemma is an example of a broader phenomenon: Matching moments plus “noise” implies tiny Hellinger distance. A great reference is Pollard’s book and lecture notes. By the way, Pollard’s expository writing is generally wonderful. For instance, check out his (unrelated) paper on guilt-free manipulation of conditional densities if you ever wondered about how to condition on a measure zero event without regret. Not that computer scientists are generally concerned about it.

### Matching upper bound

The matching upper bound is a bit trickier. The proof uses a convenient notion of excess moments that was introduced by Pearson. You might have heard of excess kurtosis (the fourth excess moment) as it appears in the info box on every Wikipedia article about distributions. Excess moments have the appealing property that they are invariant under adding and subtracting a common term from the variance of each component.

The second excess moment is always ${0.}$ As the picture suggests, we can make one component increasingly spiky and think of it as a distribution that places all its weight on a single point. In accordance with this notion of excess moments it makes sense to reparametrize the mixture in such a way that the parameters are also independent of adding and subtracting a common variance term. Assuming the overall mean of the mixture is ${0,}$ this leaves us with three free parameters ${\alpha,\beta,\gamma.}$ The third through sixth excess moment give us four equations in these three parameters.

Expressing the excess moments in terms of our new parameters ${\alpha,\beta,\gamma,}$ we can derive in a fairly natural manner a ninth degree polynomial ${p_5(y)}$ whose coefficients depend on the first ${5}$ excess moments so that ${\alpha}$ has to satisfy ${p_5(\alpha)=0.}$ The polynomial ${p_5}$ was already used by Pearson. Unfortunately, ${p_5}$ can have multiple roots and this is to be expected since ${5}$ moments are not sufficient to identify a mixture of two gaussians. Pearson computed the mixtures associated with each of the roots and threw out the invalid solutions (e.g. the ones that give imaginary variances), getting two valid mixtures that matched on the first five moments. He then chose the one whose sixth moment was closest to the observed sixth moment.

We proceed somewhat differently from Pearson after computing ${p_5}$. We derive another polynomial ${p_6}$ (depending on all six moments) and a bound ${B}$ such that ${\alpha}$ is the only solution to the system of equations

$\displaystyle \big\{p_5(y)=0,\quad p_6(y)=0,\quad 0

This approach isn’t yet robust to small perturbations in the moments; for example, if ${p_5}$ has a double root at ${\alpha}$, it may have no nearby root after perturbation. We therefore consider the polynomial ${r(y) = p_5^2(y)+p_6^2(y)}$ which we know is zero at ${\alpha.}$ We argue that ${r(y)}$ is significantly nonzero for any ${y}$ significantly far from ${\alpha}$. This is the main difficulty in the proof, but if you really read this far, it’s perhaps not a bad point to switch to reading our paper.

## Conclusions

I sure drooled enough over how much I like Pearson’s paper. Let me conclude with a different practical thought. There were a bunch of tools (as in actual computer programs—gasp!) that made the work a lot easier. Before we proved the lower bound we had verified it using arbitrary floating point precision numerical integration to accuracy ${10^{-300}}$ for ${\sigma^2=10,100,1000,\dots}$ and the decrease in Hellinger distance was exactly what it should be. We used a Python package called mpmath for that.

Similarly, for the upper bound we used a symbolic computation package in Python called SymPy to factor various polynomials, perform substitutions and multiplications. It would’ve been tedious, time-consuming and error prone to do everything by hand (even if the final result is not too hard to check by hand).

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

# False Discovery and Differential Privacy

The Simons program on Big Data wrapped up about a month ago with a workshop on Big Data and Differential Privacy. With the possible exception of one talk on the last day after lunch, it was a really fabulous workshop. You would’ve enjoyed attending even if you were neither interested in Big Data nor Differential Privacy. One reason is that the organizers aimed to address a problem that transcends both machine learning and privacy, a problem relevant to all empirical sciences. The problem is false discovery.

If you’ve never seen it, you should check out this XKCD comic explaining why green jelly beans cause acne. The idea is that if you evaluate enough hypotheses on the same data set, eventually you will find something significant. The availability of public data sets used by thousands of scientists greatly exacerbates the problem. An important example is the area of Genome Wide Association Studies (GWAS).

A typical study has a few hundred or perhaps a few thousands patient DNA sequences. Each sequence contains a few hundred thousands so-called SNPs. Scientists check each SNP for possible correlation with a particular phenotype. Significance tests and p-values are part of the standard method. But what p-value is reasonable? In GWAS typical p-values are as small as ${10^{-6}}$ or even ${10^{-7}.}$ They get smaller ever year as the SNP resolution of the sequenced genome increases. How many SNPs are incorrectly declared significant? How do we interpret these scientific findings?

A similar discussion arose recently in the FDA induced shutdown of 23andMe. Lior Pachter started a heated discussion on his blog about why the findings of of 23andMe are unreliable. In short: Too many hypotheses evaluated against a single genome. Meanwhile, Scott Aaronson asserts his right to misinterpret probabilities.

GWAS is certainly not the only example. Some think a major scientific crisis is upon us.

So, what can we do about it? A person who has spent at least two decades thinking about this problem is Yoav Benjamini. Luckily, he agreed to give a tutorial about his work at the privacy workshop revolving around the idea of False Discovery Rate.

# False Discovery Rate

The basic setup is this. There are ${m}$ null hypotheses ${h_1,\dots,h_m}$ that we would like to test. A discovery corresponds to a rejected null hypothesis. Let ${V}$ be the random variable counting the number of false discoveries, i.e., rejected null hypotheses that are actually true. We also let ${R}$ denote the total number of rejected hypothesis, i.e., all discoveries both true and false. The false discovery rate (FDR) is the expectation of the ratio ${V/R,}$ where we define the ratio as ${0}$ if ${R=0.}$

The idea behind this definition is that if there are many discoveries, it might be tolerable to make some false discoveries. However, if all null hypotheses are true, all discoveries are false. So, even a single discovery brings the FDR up to ${1}$ (the largest possible value).

I encourage you to check out Omer Reingold’s Musings on False Discovery Rate from a Computer Scientist’s point of view.

Now, suppose we want to design an experiment while controlling the FDR. That is we want to make sure that the FDR is at most some ${\alpha< 1.}$ Benjamini and Hochberg suggested an algorithm for doing exactly that. With more than 20000 citations this might be one of the most widely cited algorithms:

Let ${p_1,\dots,p_m}$ be ${p}$-values corresponding to our ${m}$ hypotheses. Sort these ${p}$-values in increasing order ${p_{(1)}\le \dots\le p_{(m)}.}$

1. Find the largest ${k}$ such that ${p_{(k)} \le \frac{k}{m}\cdot \alpha}$.

2. Reject ${h_{(1)},\dots,h_{(k)}.}$

An important theorem they proved is that indeed this procedure controls the false discovery rate under certain assumptions. For example, if our test statistics are independent of each other, this holds. Of course, independence is a very strong assumption and a lot of work in statistics aims to weaken these assumptions.

My general feeling about this definition is that it is optimistic in several regards. The first is that we only talk about expectations. It is easy to come up with examples where the expectation is small but the with small but non-negligible probability there are many false discoveries. Second, we need several fairly strong assumptions to be able to prove that such a procedure actually controls the false discovery rate. Finally, the method puts lot of faith in careful experimental design and proper execution. In particular to apply the methodology we need knowledge of what hypotheses we might test.

Nevertheless, this optimism in the definition leads to usable and realistic answers as to how large the sample should be in concrete experimental setups. This must have been a major factor in the success of this methodology.

# Differential Privacy

How is any of this related to privacy? The short answer is that Differential Privacy by itself controls false discovery in a certain precise sense—though formally different from the above. Indeed, I will formally show below that Differential Privacy implies small “generalization error”.

This is by no means an accident of the definition. There is a deeper reason for it. Intuitively, Differential Privacy ensures that the outcome of a statistical analysis does not depend on the specifics of the data set but rather the underlying properties of the population. This is why it gives privacy. I may be able to learn from the data that smoking causes cancer, but I wouldn’t be able to learn that a particular individual in the database, say Donald Draper, smokes and has cancer. The flip side is that Differential Privacy does not allow you to access the data set arbitrarily often. Instead it attaches a cost to each operation and an overall budget that you better not exceed. The most common complain about Differential Privacy is that it doesn’t allow you to do whatever the heck you want with the data. My response is usually that neither does statistics. A data set has limited explanatory power. Use it carelessly and you’ll end up with false discovery. I think it’s a feature of Differential Privacy—not a shortcoming—that it makes this fundamental fact of nature explicit.

If this philosophical excursion struck you as unconvincing, let me try the formal route. This argument is folklore by the way. I learned it from Guy Rothblum a few years ago who attributed it to Frank McSherry at the time. Let me know if I’m missing the right attribution.

Suppose we want to learn a concept ${c\colon\{0,1\}^d \rightarrow \{0,1\}}$ from labeled examples ${D=\{(x_1,l_1),\dots,(x_m,l_m)\}}$ drawn from some underlying distribution ${P}$.

Now, suppose we use an ${\epsilon}$-differentially private learning algorithm ${{\cal A}}$ for the task. That is ${{\cal A}}$ is a randomized algorithm such that changing one example has little effect on the output distribution of the algorithm. Formally, for any set of examples ${D'}$ that differs from ${D}$ in only one pair, we have for any set of output hypotheses ${H}$:

$\displaystyle \Pr\{ A(D) \in H \} \ge e^{-\epsilon}\Pr\{ A(D') \in H \}.$

Furthermore, assume that ${{\cal A}}$ is ${\alpha}$-useful in the sense that it always outputs a hypothesis ${h}$ that has error at most ${\alpha}$ on the training examples. Here, ${\epsilon,\alpha \ll 1.}$ (By the way, I’m using the word “hypothesis” here in its learning theory sense not the “null hypothesis” sense.)

I claim that with the output of ${{\cal A}}$ must have error at most ${O(\epsilon + \alpha)}$ on the underlying distribution ${P.}$ This means precisely that the output generalizes. Importantly, we didn’t assume anything about the hypothesis class! We only assume that the learner is differentially private.

Proof sketch: Pick a fresh example ${(x^*,l^*)}$ from the distribution ${P.}$ Let ${H^*= \{ h \colon h(x^*) = c(x^*) \}}$ be the set of hypotheses that agree with the concept ${c}$ on this example ${x^*.}$ Let ${D'}$ be the example set where we remove a random element from ${D}$ and replace it with ${(x^*,l^*).}$ Since, ${{\cal A}}$ is useful, it must be the case that ${h' = {\cal A}(D')}$ has error at most ${\alpha}$ on ${D'}$. In particular, it classifies ${x^*}$ correctly with probability ${1-\alpha.}$ This is because all examples in ${D'}$ are identically distributed. Formally, ${\Pr\{ {\cal A}(D') \in H^* \}\ge 1-\alpha.}$ Note the randomness is now over both ${{\cal A}}$ and the replacement we did. Appealing to the definition of Differential Privacy above, it follows that

$\displaystyle \Pr\{ A(D) \in H^*\} \ge e^{-\epsilon}(1-\alpha) \ge 1 - O(\epsilon + \alpha).$

That means ${A(D)}$ is correct on an *unseen* example from ${D}$ with high probability.

The above is just a simple example of a broader phenomenon showing that stability of a learning algorithm implies generalization bounds. Differential Privacy is a strong form of stability that implies most notions of stability studied in learning theory.

# What’s next?

False discovery and privacy loss are two persistent issues we’re facing in the era of data analysis. The two problems share some fundamental characteristics. It could be very fruitful think about both problems in relation to each other, or, even as facets of the same underlying problem. Technically, there seems to be some unexplored middle ground between FDR and Differential Privacy that I’d like to understand better.

The discussion here is also closely related to my first post of the semester in which I asked: What should a theory of big data do? I argued that besides posing concrete technical challenges, big data poses a deep conceptual problem. We seem to have difficulty capturing the way people interact with data today. This makes it hard for us to get control over the things that can go wrong. I seem to have come a full circle asking the same question again. Of course, I knew that this was too broad a question to be resolved in the span of four months.