Nonconvex
nonoptimization
Moritz Hardt
Nonconvex machine learning poses two challenges:
Parameterization:
Why and when are larger models easier to train?
Generalization:
Why do huge models transfer well from seen to unseen examples?
Learning setup
Given i.i.d. samples S={z1,…,zn}$S=\{{z}_{1},\dots ,{z}_{n}\}$ from distribution D$D$
Goal: Find a "good" model w$w$
Risk R(w)=𝔼z∼Dℓ(w;z)$R(w)={\mathbb{E}}_{z\sim D}\ell (w;z)$
Empirical risk RS(w)=1n∑ni=1ℓ(w;zi)${R}_{S}(w)=\frac{1}{n}\sum _{i=1}^{n}\ell (w;{z}_{i})$
Generalization error:
R(w)−RS(w)$R(w){R}_{S}(w)$
Fundamental theorem of machine learning
R(w)=(R(w)−RS(w))+RS(w)$$R(w)=(R(w){R}_{S}(w))+{R}_{S}(w)$$
Risk =
Generalization error
+
Empirical Risk
"Learning equals optimization plus generalization."
Convexico
Learning problems are solved by natural convex formulation.
Generalization follows from regularization and convex analysis.
Gradientina
Learning problems are solved by (stochastic) gradient method.
Generalization tied to (implicit) properties of stochastic gradient method.
Optopia
Learning governed by yettobediscovered principles.
Messiland
Learning happens for no good reason.
Deep Learning
Currently arguably in Messiland.
See Rahimi's NIPS keynote.
Can we put it in Convexico?
Unlikely, although don't give up yet...
How about Optopia?
To know oneself is to disbelieve Optopia.
How about Gradientina?
Let's entertain this thought.
Residual networks
Proposed by He, Zhang, Ren, Sun (2015)
Huge empirical success
Stateoftheart for vision tasks
Simple reparameterization
Residual networks
Standard layer: T(x)$T(x)$
Residual layer: R(x)=x+T(x)$R(x)=x+T(x)$
R→Id$R\to \mathrm{I}\mathrm{d}$ as T→0$T\to 0$
Small T$T$, many layers
Typical T(x)=B(σ(Ax)),$T(x)=B(\sigma (Ax)),$
A,B$A,B$ affine, σ$\sigma $ nonlinearity.
Why residual layers?
Standard conv nets cannot easily represent identity map
Trivial for resnet (by setting weights to 0).
Empirical observation:
Avoids vanishing gradients at greater depth
Can we prove this?
(Recht's) linearization principle
If it's crazy in the linear case,
it's probably also crazy in the nonlinear case.
Use linear case to sanity check your ideas!
Linear residual networks
Linear map: M=(I+Ak)(I+Ak−1)⋯(I+A1)$M=(I+{A}_{k})(I+{A}_{k1})\cdots (I+{A}_{1})$
Assume ‖Ai‖≤O(1/k).$\Vert {A}_{i}\Vert \le O(1/k).$
Can represent any linear R$R$ with ‖R‖≤1$\Vert R\Vert \le 1$
Regression problem
Φ(A1,…,Ak)=E‖y−Mx‖2$\mathrm{\Phi}({A}_{1},\dots ,{A}_{k})=E\Vert yMx{\Vert}^{2}$
No vanishing gradients
Let Bτ={(A1,…,Ak):‖Ai‖≤τ}.${B}_{\tau}=\{({A}_{1},\dots ,{A}_{k}):\Vert {A}_{i}\Vert \le \tau \}.$
Theorem. [HMa 2017]
Let
τ<1.$\tau <1.$ Assume an isotropic input distribution.
Then, for any
k$k$tuple
A∈Bτ,$A\in {B}_{\tau},$ we have
‖∇Φ(A)‖2F≥Ω(k(1−τ)2k−2(ϕ(A)−OPT))$${\Vert \mathrm{\nabla}\mathrm{\Phi}(A)\Vert}_{F}^{2}\ge \mathrm{\Omega}\left(k(1\tau {)}^{2k2}\left(\varphi (A)\mathrm{O}\mathrm{P}\mathrm{T}\right)\right)$$
Main idea
Let E=(I+Ak)⋯(I+A1)−R$E=(I+{A}_{k})\cdots (I+{A}_{1})R\phantom{\rule{thinmathspace}{0ex}}$ be the error matrix.
Observe
∂Φ∂Ai=(I+A⊤i+1)⋯(I+A⊤k)E(I+A⊤1)⋯(I+A⊤i−1)$$\frac{\mathrm{\partial}\mathrm{\Phi}}{\mathrm{\partial}{A}_{i}}=(I+{A}_{i+1}^{\mathrm{\top}})\cdots (I+{A}_{k}^{\mathrm{\top}}){E}(I+{A}_{1}^{\mathrm{\top}})\cdots (I+{A}_{i1}^{\mathrm{\top}})$$
Since ‖Aj‖≤τ<1,$\Vert {A}_{j}\Vert \le \tau <1,$ each term (I+A⊤j)$(I+{A}_{j}^{\mathrm{\top}})$ is invertible!
Hence, gradient can only vanish if E=0.$E=0.$
What about linear networks in standard form?
They do have saddles!
But local minima are global
[Kawaguchi 2016]
— proof is more tricky
How about the real thang, nonlinear residual networks?
Still wide open!
Settle for weaker property than no saddles?
Are saddles even a problem?

Noise addition defeats saddle points.
[GeHuangJinYuan 2015, ...]

Gradient descent does not converge to strict saddles.
[LeeSimchowitzJordanRecht 2016, see also Pemantle 1990]
Regions of Gradientina

No saddle points, e.g., quasiconvex

Saddles exist, but they're all strict

Flat saddles exist, but all local minima are global.

Local minima are almost as good as global minima.

Bad local minima exist, but gradient methods converge to good ones
Ordered by how tricky I think these properties are to show.
Some residents of Gradientina
*under often strong assumptions
 Tensor factorization
[GeHuangJinYuan 2015]
 Dictionary learning [SunQuWright
2015]
 Phase retrieval [SunQuWright
2015]
 Matrix completion [LeeMa 2016]
 Linear systems identification [HMaRecht 2016]
Why are local minima good?
[SoudryCarmon 2016] Attempt to show:
Local minima in overparameterized neural networks
must have zero training loss.
Assumptions needed make the result possibly vacuous,
but the idea is great!
(Over)parameterization
Model architecture and size ease optimization
Successful optimization theory must explain why
As size increases, are we hurting generalization?
An experiment to keep in mind
 Fit model on data X,y$X,y$
 Pick random labels ỹ $\stackrel{~}{y}$
 Fit (same) model on X,ỹ $X,\stackrel{~}{y}$
Zhang, Bengio, H, Recht, Vinyals (2016)
Stateoftheart neural networks converge to zero training error on random labeling of the data.
Consistent across
 different architectures,
 multiple data sets,
 various randomization schemes.
Fitting noise
Varying noise level
Effect on training time
Effect of regularization
Data augmentation 
Weight decay 
Test accuracy 
yes 
yes 
89.05% 
yes 
no 
89.31% 
no 
yes 
86.03% 
no 
no 
85.75% 
Main takeaways
Optimization remains easy even if no generalization occurs
Reasons why deep nets are easy to optimize
must be different from why they generalize.
Some explicit regularizers not the reason for generalization
Convergence rates alone don't tell the whole story
Gradient methods as regularization
Few passes over the data implies small generalization error
[HRechtSinger 2016]
What notion of complexity do gradient methods minimize?
[NeyshaburBhojanapalli, McAllester, Srebro 2017, Neyshabur, Tomioka, Srebro
2014]
Lots and lots of ongoing work on generalization
Conclusion
Moving deep learning into Gradientina is a worthy goal
We're still far from it!
Model size and layout is key to optimization
But it places the burden on generalization
Let E=(I+Ak)⋯(I+A1)−R be the error matrix.