"Finding Critical and Gradient-Flat Points of Deep Neural Network Loss " by Charles Gearhart Frye '09

Doctoral Dissertations

Title

Finding Critical and Gradient-Flat Points of Deep Neural Network Loss Functions

Author

Charles Gearhart Frye '09, Illinois Mathematics and Science AcademyFollow

Date of Award

Spring 2020

Document Type

Dissertation

Degree Name

Doctor of Philosophy in Neuroscience

Degree Program

PhD, Neuroscience

University

University of California, Berkeley

Department

Graduate Division of the University of California, Berkeley

Graduation Date

2020

First Advisor

Associate Professor Michael R DeWeese, Co-chair

Second Advisor

Adjunct Assistant Professor Kristofer E Bouchard, Co-chair

Third Advisor

Professor Bruno A Olshausen

Fourth Advisor

Assistant Professor Moritz Hardt

Keywords

Critical Points, Neural Networks, Loss Functions, Gradient Flat Points, Nonconvex Optimization, Machine Learning

Subject Categories

Artificial Intelligence and Robotics | Computer Sciences | Physical Sciences and Mathematics

Abstract

Despite the fact that the loss functions of deep neural networks are highly non-convex, gradient-based optimization algorithms converge to approximately the same performance from many random initial points. This makes neural networks easy to train, which, combined with their high representational capacity and implicit and explicit regularization strategies, leads to machine-learned algorithms of high quality with reasonable computational cost in a wide variety of domains.

One thread of work has focused on explaining this phenomenon by numerically characterizing the local curvature at critical points of the loss function, where gradients are zero. Such studies have reported that the loss functions used to train neural networks have no local minima that are much worse than global minima, backed up by arguments from random matrix theory. More recent theoretical work, however, has suggested that bad local minima do exist.

In this dissertation, we show that one cause of this gap is that the methods used to numerically find critical points of neural network losses suffer, ironically, from a bad local minimum problem of their own. This problem is caused by gradient-flat points, where the gradient vector is in the kernel of the Hessian matrix of second partial derivatives. At these points, the loss function becomes, to second order, linear in the direction of the gradient, which violates the assumptions necessary to guarantee convergence for second order critical point-finding methods. We present evidence that approximately gradient-flat points are a common feature of several prototypical neural network loss functions.

Recommended Citation

Frye, Charles Gearhart '09, "Finding Critical and Gradient-Flat Points of Deep Neural Network Loss Functions" (2020). Doctoral Dissertations. 26.
https://digitalcommons.imsa.edu/alumni_dissertations/26

Download

Included in

Artificial Intelligence and Robotics Commons

COinS

Doctoral Dissertations

Title

Author

Date of Award

Document Type

Degree Name

Degree Program

University

Department

Graduation Date

First Advisor

Second Advisor

Third Advisor

Fourth Advisor

Keywords

Subject Categories

Abstract

Recommended Citation

Included in

Browse

Search

Author Corner

Links

IMSA News

Links

Doctoral Dissertations

Title

Author

Date of Award

Document Type

Degree Name

Degree Program

University

Department

Graduation Date

First Advisor

Second Advisor

Third Advisor

Fourth Advisor

Keywords

Subject Categories

Abstract

Recommended Citation

Included in

Share

Browse

Search

Author Corner

Links

IMSA News

Links