Doctoral Dissertations
Date of Award
Spring 2020
Document Type
Dissertation
Degree Name
Doctor of Philosophy in Neuroscience
Degree Program
PhD, Neuroscience
University
University of California, Berkeley
Department
Graduate Division of the University of California, Berkeley
Graduation Date
2020
First Advisor
Associate Professor Michael R DeWeese, Co-chair
Second Advisor
Adjunct Assistant Professor Kristofer E Bouchard, Co-chair
Third Advisor
Professor Bruno A Olshausen
Fourth Advisor
Assistant Professor Moritz Hardt
Keywords
Critical Points, Neural Networks, Loss Functions, Gradient Flat Points, Nonconvex Optimization, Machine Learning
Subject Categories
Artificial Intelligence and Robotics | Computer Sciences | Physical Sciences and Mathematics
Abstract
Despite the fact that the loss functions of deep neural networks are highly non-convex, gradient-based optimization algorithms converge to approximately the same performance from many random initial points. This makes neural networks easy to train, which, combined with their high representational capacity and implicit and explicit regularization strategies, leads to machine-learned algorithms of high quality with reasonable computational cost in a wide variety of domains.
One thread of work has focused on explaining this phenomenon by numerically characterizing the local curvature at critical points of the loss function, where gradients are zero. Such studies have reported that the loss functions used to train neural networks have no local minima that are much worse than global minima, backed up by arguments from random matrix theory. More recent theoretical work, however, has suggested that bad local minima do exist.
In this dissertation, we show that one cause of this gap is that the methods used to numerically find critical points of neural network losses suffer, ironically, from a bad local minimum problem of their own. This problem is caused by gradient-flat points, where the gradient vector is in the kernel of the Hessian matrix of second partial derivatives. At these points, the loss function becomes, to second order, linear in the direction of the gradient, which violates the assumptions necessary to guarantee convergence for second order critical point-finding methods. We present evidence that approximately gradient-flat points are a common feature of several prototypical neural network loss functions.
Recommended Citation
Frye, Charles Gearhart '09, "Finding Critical and Gradient-Flat Points of Deep Neural Network Loss Functions" (2020). Doctoral Dissertations. 26.
https://digitalcommons.imsa.edu/alumni_dissertations/26