Non-convex machine learning poses two challenges:
Why and when are larger models easier to train?
Why do huge models transfer well from seen to unseen examples?
Given i.i.d. samples
Goal: Find a "good" model
"Learning equals optimization plus generalization."
Learning problems are solved by natural convex formulation.
Generalization follows from regularization and convex analysis.
Learning problems are solved by (stochastic) gradient method.
Generalization tied to (implicit) properties of stochastic gradient method.
Learning governed by yet-to-be-discovered principles.
Learning happens for no good reason.
Currently arguably in Messiland. See Rahimi's NIPS keynote.
Can we put it in Convexico? Unlikely, although don't give up yet...
How about Optopia? To know oneself is to disbelieve Optopia.
How about Gradientina? Let's entertain this thought.
Proposed by He, Zhang, Ren, Sun (2015)
Huge empirical success
State-of-the-art for vision tasks
Standard conv nets cannot easily represent identity map
Trivial for resnet (by setting weights to 0).
Avoids vanishing gradients at greater depth
Can we prove this?
Use linear case to sanity check your ideas!
Can represent any linear
Hence, gradient can only vanish if
What about linear networks in standard form? They do have saddles!
But local minima are global [Kawaguchi 2016] — proof is more tricky
How about the real thang, non-linear residual networks?
Still wide open!
Settle for weaker property than no saddles?
Ordered by how tricky I think these properties are to show.
[Soudry-Carmon 2016] Attempt to show:
Local minima in overparameterized neural networks
must have zero training loss.
Assumptions needed make the result possibly vacuous,
but the idea is great!
Model architecture and size ease optimization
Successful optimization theory must explain why
As size increases, are we hurting generalization?
Zhang, Bengio, H, Recht, Vinyals (2016)
|Data augmentation||Weight decay||Test accuracy|
Optimization remains easy even if no generalization occurs
Reasons why deep nets are easy to optimize
must be different from why they generalize.
Some explicit regularizers not the reason for generalization
Convergence rates alone don't tell the whole story
Few passes over the data implies small generalization error
What notion of complexity do gradient methods minimize?
[Neyshabur-Bhojanapalli, McAllester, Srebro 2017, Neyshabur, Tomioka, Srebro 2014]
Lots and lots of ongoing work on generalization
Moving deep learning into Gradientina is a worthy goal
We're still far from it!
Model size and layout is key to optimization
But it places the burden on generalization