Pearson’s polynomial

A 120 year old algorithm for learning mixtures of gaussians turns out to be optimal

So, how was math writing in 1894? I’d imagined it to be a lot like one of Nietzsche’s last diary entries except with nouns replaced by German fraktur symbols, impenetrable formalisms and mind-bending passive voice constructions. I was in for no small surprise when I started reading Karl Pearson’s Contributions to the Mathematical Theory of Evolution. The paper is quite enjoyable! The math is rigorous yet not overly formal. Examples and philosophical explanations motivate various design choices and definitions. A careful experimental evaluation gives credibility to the theory.

The utterly mind-boggling aspect of the paper however is not how well-written it is, but rather what Pearson actually did in it and how far ahead of his time he was. This is the subject of this post.

Pearson was interested in building a mathematical theory for evolutionary biology. In hindsight, his work is also one of the earliest contributions to the computational theory of evolution. This already strikes me as visionary as it remains a hot research area today. The Simons Institute in Berkeley just devoted a semester-long program to it.

In his paper, he considers the distribution of a single measurement among a population, more concretely, the forehead width to body length ratio among a population of crabs. The crab data had been collected by zoologist Weldon and his wife during their 1892 easter vacation on Malta. Wikipedia describes the crab (carcinus maenas) as one of the “world’s worst alien invasive species”, but fails to acknowledge the crab’s academic contributions.

The Pearson-Weldon crab: Evil alien invader or misjudged scientific apprentice?

The Pearson-Weldon crab: Evil alien invader or misjudged scientific apprentice?

The Weldons had taken 1000 samples each with 23 attributes of wich all but the one attribute describing forehead to body length ratio seemed to follow a single normal distribution. Regarding this peculiar deviation from normal, Pearson notes:

In the case of certain biological, sociological, and economic measurements there is, however, a well-marked deviation from this normal shape, and it becomes important to determine the direction and amount of such deviation. The asymmetry may arise from the fact that the units grouped together in the measured material are not really homogeneous. It may happen that we have a mixture of {2, 3,\dots, n} homogeneous groups, each of which deviates about its own mean symmetrically and in a manner represented with sufficient accuracy by the normal curve.

What the paragraph describes is the problem of learning a mixture of gaussians: Each sample is drawn from one of several unknown gaussian distributions. Each distribution is selected with an unknown mixture probability. The goal is to find the parameters of each gaussian distribution as well as the mixing probabilities. Pearson focuses on the case of two gaussians which he believes is of special importance in the context of evolutionary biology:

A family probably breaks up first into two species, rather than three or more, owing to the pressure at a given time of some particular form of natural selection.

With this reasoning he stipulates that the crab measurements are drawn from a mixture of two gaussians and aims to recover the parameters of this mixture. Learning a mixture of two gaussians at the time was a formidable task that lead Pearson to come up with a powerful approach. His approach is based on the method of moments which uses the empirical moments of a distribution to distinguish between competing models. Given {n} samples {x_1,...,x_n} the {k}-th empirical moment is defined as {\frac1n\sum_ix_i^k}, which for sufficiently large {n} will approximate the true moment {\mathbb{E}\,x_i^k}. A mixture of two one-dimensional gaussians has {5} parameters so one might hope that {5} moments are sufficient to identify the parameters.

Pearson derived a ninth degree polynomial {p_5} in the first {5} moments and located the real roots of this polynomial over a carefully chosen interval. Each root gives a candidate mixture that matches the first {5} moments; there were two valid solutions, among which Pearson selected the one whose {6}-th moment was closest to the observed empirical {6}-th moment.

Pearson goes on to evaluate this method numerically on the crab samples—by hand. In doing so he touches on a number of issues such as numerical stability and number of floating point operations that seem to me as being well ahead of their time. Pearson was clearly aware of computational constraints and optimized his algorithm to be computationally tractable.

Learning mixtures of gaussians

Pearson’s seminal paper proposes a problem and a method, but it doesn’t give an analysis of the method. Does the method really work on all instances or did he get lucky on the crab population? Is there any efficient method that works on all instances? The sort of theorem we would like to have is: Given {n} samples from a mixture of two gaussians, we can estimate each parameter up to small additive distance in polynomial time.

Eric Price and I have a recent result that resolves this problem by giving a computationally efficient estimator that achieves an optimal bound on the number of samples:

Theorem. Denoting by {\sigma^2} the overall variance of the mixture, {\Theta(\sigma^{12})} samples are necessary and sufficient to estimate each parameter to constant additive distance. The estimator is computationally efficient and extends to the {d}-dimensional mixture problem up to a necessary factor of {O(\log d)} in the sample requirement.

Strikingly, our one-dimensional estimator turns out to be very closely related to Pearson’s approach. Some extensions are needed, but I’m inclined to say that our result can be interpreted as showing that Pearson’s original estimator is in fact an optimal solution to the problem he proposed.

Until a recent breakthrough result by Kalai, Moitra and Valiant (2010), no polynomial bounds for the problem were known except under stronger assumptions on the gaussian components.

 A word about definitions

Of course, we need to be a bit more careful with the above statements. Clearly we can only hope to recover the parameters up to permutation of the two gaussian components as any permutation gives an identical mixture. Second, we cannot hope to learn the mixture probabilities in general up to small error. If, for example, the two gaussian components are indistinguishable, there is no way of learning the mixture probabilities—even though we can estimate means and variances easily by treating the distribution as one gaussian. On the other hand, if the gaussian components are distinguishable then it is not hard to estimate the mixture probabilities from the other parameters and the overall mean and variance of the mixture. For this reason it makes some sense to treat the mixture probabilities separately and focus on learning means and variances of the mixture.

Another point to keep in mind is that our notion of parameter distance should be scale-free. A reasonable goal is therefore to learn parameters {\hat \mu_1,\hat \mu_2,\hat\sigma_1,\hat\sigma_2} such that if {\mu_1,\mu_2,\sigma_1,\sigma_2} are the unknown parameters, there is a permutation {\pi\colon\{1,2\}\rightarrow\{1,2\}} such that for all {i\in\{1,2\},}

\displaystyle \max\left( \big|\mu_i-\mu_{\pi(i)}\big|^2, \big|\sigma_i^2-\sigma_{\pi(i)}^2\big|\right)\le\epsilon \cdot \sigma^2.

Here, {\sigma^2} is the overall variance of the mixture. The definition extends to the {d}-dimensional problem for suitable notion of variance (essentially the maximum variance in each coordinate) and by replacing absolute values with {\ell_\infty}-norms.

Some intuition for the proof

While the main difficulty in the proof is the upper bound, let’s start with some intuition for why we can’t hope to do better than {\sigma^{12}}. Just like Pearson’s estimator and that of Kalai, Moitra and Valiant, our estimator is based on the first six moments of the distribution. It is not hard to show that estimating the {k}-th moment up to error {\epsilon\sigma^k} requires {\Omega(1/\epsilon^2)} samples. Hence, learning each of the first six moments to constant error needs {\Omega(\sigma^{12})} samples. This doesn’t directly imply a lower bound for the mixture problem. After all, learning mixtures could be strictly easier than estimating the first six moments. Nevertheless, we know from Pearson that there are two mixtures that have matching first {5} moments. This turns out to give you a huge hint as to how to prove a lower bound for the mixture problem. The key observation is that two mixtures that have matching first {5} moments and constant parameters are extremely close to each other after adding a gaussian variable {N(0,\sigma^2)} to both mixtures for large enought {\sigma^2}. Note that we still have two mixtures of gaussians after this operation and the total variance is {O(\sigma^2).}

Making this statement formal requires choosing the right probability metric. In our case this turns out to be the squared Hellinger distance. Sloppily identifying distributions with density functions like only a true computer scientist would do, the squared Hellinger distance is defined as

\displaystyle \mathrm{H}^2(f,g) = \frac12\int_{-\infty}^{\infty} \Big(\sqrt{f(x)}-\sqrt{g(x)}\Big)^2 \mathrm{d}x\,.

Lemma. Let {f,g} be mixtures with matching first {k} moments and constant parameters. Let {p,q} be the distributions obtained by adding {N(0,\sigma^2)} to each distribution. Then, {\mathrm{H}^2(p,q)\le O(1/\sigma^{2k+2})}.

The lemma immediately gives the lower bound we need. Indeed, the squared Hellinger distance is sub-additive on product distributions. So, it’s easy to see what happens for {n} independent samples from each distribution. This can increase the squared Hellinger distance by at most a factor {n}. In particular, for any {n=o(\sigma^{2k+2})} the squared Hellinger distance between {n} samples from each distribution is at most {o(1)}. Owing to a useful relation between the Hellinger distance and total variation, this implies that also the total variation distance is at most {o(1)}. Using the definition of total variation distance, this means no estimator (efficient or not) can tell apart the two distributions from {n=o(\sigma^{2k+2})} samples with constant success probability. In particular, parameter estimation is impossible.

Mixtures with identical first 5 moments before and after adding noise

This lemma is an example of a broader phenomenon: Matching moments plus “noise” implies tiny Hellinger distance. A great reference is Pollard’s book and lecture notes. By the way, Pollard’s expository writing is generally wonderful. For instance, check out his (unrelated) paper on guilt-free manipulation of conditional densities if you ever wondered about how to condition on a measure zero event without regret. Not that computer scientists are generally concerned about it.

Matching upper bound

The matching upper bound is a bit trickier. The proof uses a convenient notion of excess moments that was introduced by Pearson. You might have heard of excess kurtosis (the fourth excess moment) as it appears in the info box on every Wikipedia article about distributions. Excess moments have the appealing property that they are invariant under adding and subtracting a common term from the variance of each component.

Identical excess moments

The second excess moment is always {0.} As the picture suggests, we can make one component increasingly spiky and think of it as a distribution that places all its weight on a single point. In accordance with this notion of excess moments it makes sense to reparametrize the mixture in such a way that the parameters are also independent of adding and subtracting a common variance term. Assuming the overall mean of the mixture is {0,} this leaves us with three free parameters {\alpha,\beta,\gamma.} The third through sixth excess moment give us four equations in these three parameters.

Expressing the excess moments in terms of our new parameters {\alpha,\beta,\gamma,} we can derive in a fairly natural manner a ninth degree polynomial {p_5(y)} whose coefficients depend on the first {5} excess moments so that {\alpha} has to satisfy {p_5(\alpha)=0.} The polynomial {p_5} was already used by Pearson. Unfortunately, {p_5} can have multiple roots and this is to be expected since {5} moments are not sufficient to identify a mixture of two gaussians. Pearson computed the mixtures associated with each of the roots and threw out the invalid solutions (e.g. the ones that give imaginary variances), getting two valid mixtures that matched on the first five moments. He then chose the one whose sixth moment was closest to the observed sixth moment.

We proceed somewhat differently from Pearson after computing {p_5}. We derive another polynomial {p_6} (depending on all six moments) and a bound {B} such that {\alpha} is the only solution to the system of equations

\displaystyle \big\{p_5(y)=0,\quad p_6(y)=0,\quad 0<y\leq B\big\}.

This approach isn’t yet robust to small perturbations in the moments; for example, if {p_5} has a double root at {\alpha}, it may have no nearby root after perturbation. We therefore consider the polynomial {r(y) = p_5^2(y)+p_6^2(y)} which we know is zero at {\alpha.} We argue that {r(y)} is significantly nonzero for any {y} significantly far from {\alpha}. This is the main difficulty in the proof, but if you really read this far, it’s perhaps not a bad point to switch to reading our paper.

Conclusions

I sure drooled enough over how much I like Pearson’s paper. Let me conclude with a different practical thought. There were a bunch of tools (as in actual computer programs—gasp!) that made the work a lot easier. Before we proved the lower bound we had verified it using arbitrary floating point precision numerical integration to accuracy {10^{-300}} for {\sigma^2=10,100,1000,\dots} and the decrease in Hellinger distance was exactly what it should be. We used a Python package called mpmath for that.

Similarly, for the upper bound we used a symbolic computation package in Python called SymPy to factor various polynomials, perform substitutions and multiplications. It would’ve been tedious, time-consuming and error prone to do everything by hand (even if the final result is not too hard to check by hand).

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

False Discovery and Differential Privacy

The Simons program on Big Data wrapped up about a month ago with a workshop on Big Data and Differential Privacy. With the possible exception of one talk on the last day after lunch, it was a really fabulous workshop. You would’ve enjoyed attending even if you were neither interested in Big Data nor Differential Privacy. One reason is that the organizers aimed to address a problem that transcends both machine learning and privacy, a problem relevant to all empirical sciences. The problem is false discovery.

If you’ve never seen it, you should check out this XKCD comic explaining why green jelly beans cause acne. The idea is that if you evaluate enough hypotheses on the same data set, eventually you will find something significant. The availability of public data sets used by thousands of scientists greatly exacerbates the problem. An important example is the area of Genome Wide Association Studies (GWAS).

A typical study has a few hundred or perhaps a few thousands patient DNA sequences. Each sequence contains a few hundred thousands so-called SNPs. Scientists check each SNP for possible correlation with a particular phenotype. Significance tests and p-values are part of the standard method. But what p-value is reasonable? In GWAS typical p-values are as small as {10^{-6}} or even {10^{-7}.} They get smaller ever year as the SNP resolution of the sequenced genome increases. How many SNPs are incorrectly declared significant? How do we interpret these scientific findings?

A similar discussion arose recently in the FDA induced shutdown of 23andMe. Lior Pachter started a heated discussion on his blog about why the findings of of 23andMe are unreliable. In short: Too many hypotheses evaluated against a single genome. Meanwhile, Scott Aaronson asserts his right to misinterpret probabilities.

GWAS is certainly not the only example. Some think a major scientific crisis is upon us.

So, what can we do about it? A person who has spent at least two decades thinking about this problem is Yoav Benjamini. Luckily, he agreed to give a tutorial about his work at the privacy workshop revolving around the idea of False Discovery Rate.

False Discovery Rate

The basic setup is this. There are {m} null hypotheses {h_1,\dots,h_m} that we would like to test. A discovery corresponds to a rejected null hypothesis. Let {V} be the random variable counting the number of false discoveries, i.e., rejected null hypotheses that are actually true. We also let {R} denote the total number of rejected hypothesis, i.e., all discoveries both true and false. The false discovery rate (FDR) is the expectation of the ratio {V/R,} where we define the ratio as {0} if {R=0.}

The idea behind this definition is that if there are many discoveries, it might be tolerable to make some false discoveries. However, if all null hypotheses are true, all discoveries are false. So, even a single discovery brings the FDR up to {1} (the largest possible value).

I encourage you to check out Omer Reingold’s Musings on False Discovery Rate from a Computer Scientist’s point of view.

Now, suppose we want to design an experiment while controlling the FDR. That is we want to make sure that the FDR is at most some {\alpha< 1.} Benjamini and Hochberg suggested an algorithm for doing exactly that. With more than 20000 citations this might be one of the most widely cited algorithms:

Let {p_1,\dots,p_m} be {p}-values corresponding to our {m} hypotheses. Sort these {p}-values in increasing order {p_{(1)}\le \dots\le p_{(m)}.}

1. Find the largest {k} such that {p_{(k)} \le \frac{k}{m}\cdot \alpha}.

2. Reject {h_{(1)},\dots,h_{(k)}.}

An important theorem they proved is that indeed this procedure controls the false discovery rate under certain assumptions. For example, if our test statistics are independent of each other, this holds. Of course, independence is a very strong assumption and a lot of work in statistics aims to weaken these assumptions.

My general feeling about this definition is that it is optimistic in several regards. The first is that we only talk about expectations. It is easy to come up with examples where the expectation is small but the with small but non-negligible probability there are many false discoveries. Second, we need several fairly strong assumptions to be able to prove that such a procedure actually controls the false discovery rate. Finally, the method puts lot of faith in careful experimental design and proper execution. In particular to apply the methodology we need knowledge of what hypotheses we might test.

Nevertheless, this optimism in the definition leads to usable and realistic answers as to how large the sample should be in concrete experimental setups. This must have been a major factor in the success of this methodology.

Differential Privacy

How is any of this related to privacy? The short answer is that Differential Privacy by itself controls false discovery in a certain precise sense—though formally different from the above. Indeed, I will formally show below that Differential Privacy implies small “generalization error”.

This is by no means an accident of the definition. There is a deeper reason for it. Intuitively, Differential Privacy ensures that the outcome of a statistical analysis does not depend on the specifics of the data set but rather the underlying properties of the population. This is why it gives privacy. I may be able to learn from the data that smoking causes cancer, but I wouldn’t be able to learn that a particular individual in the database, say Donald Draper, smokes and has cancer. The flip side is that Differential Privacy does not allow you to access the data set arbitrarily often. Instead it attaches a cost to each operation and an overall budget that you better not exceed. The most common complain about Differential Privacy is that it doesn’t allow you to do whatever the heck you want with the data. My response is usually that neither does statistics. A data set has limited explanatory power. Use it carelessly and you’ll end up with false discovery. I think it’s a feature of Differential Privacy—not a shortcoming—that it makes this fundamental fact of nature explicit.

If this philosophical excursion struck you as unconvincing, let me try the formal route. This argument is folklore by the way. I learned it from Guy Rothblum a few years ago who attributed it to Frank McSherry at the time. Let me know if I’m missing the right attribution.

Suppose we want to learn a concept {c\colon\{0,1\}^d \rightarrow \{0,1\}} from labeled examples {D=\{(x_1,l_1),\dots,(x_m,l_m)\}} drawn from some underlying distribution {P}.

Now, suppose we use an {\epsilon}-differentially private learning algorithm {{\cal A}} for the task. That is {{\cal A}} is a randomized algorithm such that changing one example has little effect on the output distribution of the algorithm. Formally, for any set of examples {D'} that differs from {D} in only one pair, we have for any set of output hypotheses {H}:

\displaystyle \Pr\{ A(D) \in H \} \ge e^{-\epsilon}\Pr\{ A(D') \in H \}.

Furthermore, assume that {{\cal A}} is {\alpha}-useful in the sense that it always outputs a hypothesis {h} that has error at most {\alpha} on the training examples. Here, {\epsilon,\alpha \ll 1.} (By the way, I’m using the word “hypothesis” here in its learning theory sense not the “null hypothesis” sense.)

I claim that with the output of {{\cal A}} must have error at most {O(\epsilon + \alpha)} on the underlying distribution {P.} This means precisely that the output generalizes. Importantly, we didn’t assume anything about the hypothesis class! We only assume that the learner is differentially private.

Proof sketch: Pick a fresh example {(x^*,l^*)} from the distribution {P.} Let {H^*= \{ h \colon h(x^*) = c(x^*) \}} be the set of hypotheses that agree with the concept {c} on this example {x^*.} Let {D'} be the example set where we remove a random element from {D} and replace it with {(x^*,l^*).} Since, {{\cal A}} is useful, it must be the case that {h' = {\cal A}(D')} has error at most {\alpha} on {D'}. In particular, it classifies {x^*} correctly with probability {1-\alpha.} This is because all examples in {D'} are identically distributed. Formally, {\Pr\{ {\cal A}(D') \in H^* \}\ge 1-\alpha.} Note the randomness is now over both {{\cal A}} and the replacement we did. Appealing to the definition of Differential Privacy above, it follows that

\displaystyle \Pr\{ A(D) \in H^*\} \ge e^{-\epsilon}(1-\alpha) \ge 1 - O(\epsilon + \alpha).

That means {A(D)} is correct on an *unseen* example from {D} with high probability.

The above is just a simple example of a broader phenomenon showing that stability of a learning algorithm implies generalization bounds. Differential Privacy is a strong form of stability that implies most notions of stability studied in learning theory.

What’s next?

False discovery and privacy loss are two persistent issues we’re facing in the era of data analysis. The two problems share some fundamental characteristics. It could be very fruitful think about both problems in relation to each other, or, even as facets of the same underlying problem. Technically, there seems to be some unexplored middle ground between FDR and Differential Privacy that I’d like to understand better.

The discussion here is also closely related to my first post of the semester in which I asked: What should a theory of big data do? I argued that besides posing concrete technical challenges, big data poses a deep conceptual problem. We seem to have difficulty capturing the way people interact with data today. This makes it hard for us to get control over the things that can go wrong. I seem to have come a full circle asking the same question again. Of course, I knew that this was too broad a question to be resolved in the span of four months.

Power Method still Powerful

The Power Method is the oldest practical method for finding the eigenvector of a matrix corresponding to the eigenvalue of largest modulus. Most of us will live to celebrate the Power Method’s 100th birthday around 2029. I plan to throw a party. What’s surprising is that the Power Method continues to be highly relevant in a number of recent applications in algorithms and machine learning. I see primarily two reasons for this. One is computational efficiency. The Power Method happens to be  highly efficient both in terms of memory footprint and running time. A few iterations often suffice to find a good approximate solution. The second less appreciated but no less important aspect is robustness. The Power Method is quite well-behaved in the presence of noise making it an ideal candidate for a number of applications. It’s also easy to modify and to extend it without breaking it. Robust algorithms live longer. The Power Method is a great example. We saw another good one before. (Hint: Gradient Descent) In this post I’ll describe a very useful robust analysis of the Power Method that is not as widely known as it sould be. It characterizes the convergence behavior of the algorithm in a strong noise model and comes with a neat geometric intuition.

Power Method: Basics

Let’s start from scratch. The problem that the power method solves is this:

Given {A,} find the rank {1} matrix {B} that minimizes {\|A-B\|_F^2.}

Here, {\|A-B\|_F^2} is the squared Frobenius norm of the difference between {A} and {B.} (Recall that this is the sum of squared entries of A-B). Any rank {1} matrix can of course be written as {xy^T} (allowing for rectangular {A}). On the face of it, this is a non-convex optimization problem and there is no a priori reason that it should be easy! Still, if we think of {f(x,y) = \|A-xy^T\|_F^2} as our objective function, we can check that setting the gradient of {f} with respect to {x} to zero gives us the equation by {x=Ay} provided that we normalized {y} to be a unit vector. Likewise taking the gradient of {f} with respect to {y} after normalizing {x} to have unit norm leads to the equation y={A^T x.} The Power Method is the simple algorithm which repeatedly performs these update steps. As an aside, this idea of solving a non-convex low-rank optimization problem via alternating minimization steps is a powerful heuristic for a number of related problems.

For simplicity from here on we’ll assume that {A} is symmetric. In this case there is only one update rule: {x_{t+1} = Ax_{t}.} It is also prudent to normalize {x_t} after each step by dividing the vector by its norm.

The same derivation of the Power Method applies to the more general problem: Given a symmetric {n\times n} matrix {A,} find the rank {k} matrix {B} that minimizes {\|A-B\|_F^2}. Note that the optimal rank {k} matrix is also given by the a truncated singular value decomposition of {A.} Hence, the problem is equivalent to finding the first {k} singular vectors of {A.}

A symmetric rank {k} matrix can be written as {B= XX^T} where {X} is an {n\times k} matrix. The update rule is therefore {X_{t+1} = AX_{t}}. There are multiple ways to normalize {X_t}. The one we choose here is to ensure that {X_t} has orthonormal columns. That is each column has unit norm and is orthogonal to the other columns. This is can be done using good ol’ boys  Gram-Schmidt (though in practice people prefer the Householder method).

The resulting algorithm is often called Subspace Iteration. I like this name. It stresses the point that the only thing that will matter to us is the subspace spanned by the columns of {X_t} and not the particular basis that the algorithm happens to produce. This viewpoint is important in the analysis of the algorithm that we’ll see next.

A Robust Geometric Analysis of Subspace Iteration

Most of us know one analysis of the power method. You write your vector {x_t} in the eigenbasis of {A} and keep track of how matrix vector multiplication manipulates the coefficients of the vector in this basis. It works and it’s simple, but it has a couple of issues. First, it doesn’t generalize easily to the case of Subspace Iteration. Second, it becomes messy in the presence of errors. So what is a better way? What I’m going to describe was possibly done by numerical analysts in the sixties before the community moved on to analyze more modern eigenvalue methods.

Principal Angles

To understand what happens in Subspace Iteration at each step, it’ll be helpful to have a potential function and to understand under which circumstances it decreases. As argued before the potential function should be basis free to avoid syntactic arguments. The tool we’re going to use is the principal angle between subspaces. Before we see a formal definition, it turns out that much of what’s true for the angle between two vectors can be lifted to the case of subspaces of equal dimension. In fact, much of what’s true for the angle between spaces of equal dimension generalizes to unequal dimensions with some extra care.

The two subspaces we want to compare are the space spanned by the {k} columns of {X_t} and the space {U} spanned by the {r} dominant singular vectors of {A.} For now, we’ll discuss the case where {r=k.} I’ll be a bit sloppy in my notation and use the letters {U,X_t} for these subspaces.

Let’s start with {k=1}. Here the cosine of the angle {\theta} between the two unit vectors {X_t} and {U} is of course defined as {\cos\theta(U,X_t)= |U^T X_t|.} It turns out that the natural generalization for {k>1} is to define

\displaystyle \cos\theta(U,X_t)=\sigma_{\mathrm{min}}(U^T X_t).

In words, the cosine of the angle between {U} and {X_t} is the smallest singular value of the {k\times k} matrix {U^T X_t}. Here’s a sanity check on our definition. If the smallest singular value is {0} then there is a nonzero vector in the range of the matrix {X_t} that is orthogonal to the range of {U}. In an intuitive sense this shows that the subspaces are quite dissimilar. To get some more intuition, consider a basis {V} for the orthogonal complement of {U.} It makes sense to define {\sin\theta(U,X_t)= \sigma_{\mathrm{max}}(V^T X_t)}. Indeed, we can convince ourselves that this satisfies the familiar rule {1= \sin^2\theta + \cos^2\theta}. Finally, we define {\tan\theta} as the ratio of sine to cosine.

A strong noise model

Let’s be ambitious and analyze Subspace Iteration in a strong noise model. It turns out that this will not actually make the analysis a lot harder, but a lot more general. The noise model I like is one in which each matrix-vector product is followed by the addition of a possibly adversarial noise term. Specifically, the only way we can access the matrix A is through an operation of the form

\displaystyle Y = AX + G.

Here, {X} is what we get to choose and {G} is the noise matrix we may not get to choose. We assume that G is the only source of error and that arithmetic is exact. In this model our algorithm proceeds as follows:

Pick {X_0.} For {t = 1} to {t=L:}

  1. {Y_t = AX_{t-1} + G_t}
  2. {X_t = \text{Orthonormalize}(Y_t)}

The main geometric lemma

To analyze the above algorithm, we will consider the potential function {\tan\theta(U,X_t).} If we pick {X_0} randomly it’s not so hard to show that it starts out being something like {O(\sqrt{kn})} with high probability. We’ll argue that {\tan\theta(U,X_t)} decreases geometrically at the rate of {\sigma_{k+1}/\sigma_{k}} under some assumptions on {G_t}. Here, {\sigma_k} is the {k}-th largest singular value. The analysis therefore requires a separation between {\sigma_k} and {\sigma_{k+1}}. (NB: We can get around that by taking the dimension {r} of {U} to be larger than {k.} In that case we’d get a convergence rate of the form {\sigma_{r+1}/\sigma_k.} The argument is a bit more complicated, but can be found here.)

Now the lemma we’ll prove is this:

Lemma. For every {t\ge 1,}

\displaystyle \tan\theta(U,X_t) \le \frac{\sigma_{k+1}\sin\theta(U,X_{t-1})+\|V^T G_t\|} {\sigma_k\cos\theta(U,X_{t-1})-\|U^T G_t\|}.

This lemma is quite neat. First of all if {G_t=0,} then we recover the noise-free convergence rate of Subspace Iteration that I claimed above. Second what this lemma tells us is that the sine of the angle between {U} and {X_{t-1}} gets perturbed by the norm of the projection of {G_t} onto {V}, the orthogonal complement of {U.} This should not surprise us, since {\sin\theta(U,X_{t-1})=\|V^T X_{t-1}\|.} Similarly, the cosine of the angle is only perturbed by the projection of {G_t} onto {U.} This again seems intuitive given the definition of the cosine. An important consequence is this. Initially, the cosine between {U} and {X_0} might be very small, e.g. {\Omega(\sqrt{k/n})} if we pick {X_0} randomly. Fortunately, the cosine is only affected by a {k}-dimensional projection of {G_t} which we expect to be much smaller than the total norm of {G_t.}

The lemma has an intuitive geometric interpretation expressed in the following picture:

power-method

It’s straightforward to prove the lemma when {k=1}. We simply write out the definition on the left hand side, plug in the update rule that generated {X_t} and observe that the normalization terms in the numerator and denominator cancel. This leaves us with the stated expression after some simple manipulations. For larger dimension we might be troubled by the fact that we applied the Gram-Schmidt algorithm. Why would that cancel out? Interestingly it does. Here’s why. The tangent satisfies the identity:

\displaystyle \tan\theta(U,X_t) = \|(V^T X_t)(U^TX_t)^{-1}\|.

It’s perhaps not obvious, but not more than an exercise to check this. On the other hand, {X_t = Y_t R} for some invertible linear transformation {R.} This is a property of orthonormalization. Namely, {X_t} and {Y_t} must have the same range. But {(U^TY_tR)^{-1}=R^{-1}(U^T Y_t)^{-1}}. Hence

\displaystyle \|(V^T X_t)(U^TX_t)^{-1}\| = \|(V^T Y_t)(U^TY_t)^{-1}\|.

It’s now easy to proceed. First, by the submultiplicativity of the spectral norm,

\displaystyle \|(V^T Y_t)(U^TY_t)^{-1}\| \le \|(V^T Y_t)\|\cdot\|(U^TY_t)^{-1}\|.

Consider the first term on the right hand side. We can bound it as follows:

\displaystyle \|(V^T Y_t)\|\!\le\!\|V^TAX_{t-1}\| + \|V^T G_t\|\!\le\! \sigma_{k+1}\sin\theta(U,X_{t-1}) + \|V^T G_t\|.

Here we used that the operation {V^T A} eliminates the first {k} singular vectors of {A.} The resulting matrix has spectral norm at most {\sigma_{k+1}}.

The second term follows similarly. We have {\|(U^T Y_t)^{-1}\| = 1/\sigma_{\mathrm{min}}(U^TY_t)} and

\displaystyle\!\!\!\sigma_{\mathrm{min}}(U^TY_{t})\!\ge\!\sigma_{\mathrm{min}}(U^TAX_{t-1})\!-\!\|U^T G_t\|\!\ge\! \sigma_k\cos\theta(U,X_{t-1})\!-\!\|U^T G_t\|\!\!\!

This concludes the proof of the lemma.

Applications and Pointers

It’s been a long post, but it’d be a shame not to mention some applications. There are primarily three applications that sparked my interest in the robustness of the Power Method.

The first application is in privacy-preserving singular vector computation. Here the goal is to compute the dominant singular vectors of a matrix without leaking too much information about individual entries of the matrix. We can formalize this constraint using the notion of Differential Privacy. To get a differentially private version of the Power Method we must add a significant amount of Gaussian noise at each step. To get tight error bounds it’s essential to distinguish between the projections of the noise term onto the space spanned by the top singular vectors and its orthogonal complement. You can see the full analysis here.

Another exciting motivation comes from the area of Matrix Completion. Here, we only see a subsample of the matrix {A} and we wish to compute its dominant singular vectors from this subsample. Alternating Minimization (as briefly mentioned above) is a popular and successful heuristic in this setting, but rather difficult to analyze. Jain, Netrapalli and Sanghavi observed that it can be seen and analyzed as an instance of the noisy power method that we discussed here. The error term in this case arises because we only see a subsample of the matrix and cannot evaluate exact inner products with {A.} These observations lead to rigorous convergence bounds for Alternating Minimization. I recently posted some improvements.

Surprisingly, there’s a fascinating connection between the privacy setting and the matrix completion setting. In both cases the fundamental parameter that controls the error rate is the coherence of the matrix {A}. Arguing about the coherence of various intermediate solutions in the Power Method is a non-trivial issue that arises in both settings and is beyond the scope of this post.

Finally, there are very interesting tensor generalizations of the Power Method and robustness continues to be a key concern. See, for example, here and here. These tensor methods have a number of applications in algorithms and machine learning that are becoming increasingly relevant. While there has been some progress, I think it is fair to say that much remains to be understood in the tensor case. I may decide to work on this once my fear of tensor notation recedes to bearable levels.