Mathematical Introduction to Deep Learning

This book aims to provide an introduction to the topic of deep learning algorithms. We review essential components of deep learning algorithms in full mathematical detail including different artificial neural network (ANN) architectures (such as fully-connected feedforward ANNs, convolutional ANNs, recurrent ANNs, residual ANNs, and ANNs with batch normalization) and different optimization algorithms (such as the basic stochastic gradient descent (SGD) method, accelerated methods, and adaptive methods). We also cover several theoretical aspects of deep learning algorithms such as approximation capacities of ANNs (including a calculus for ANNs), optimization theory (including Kurdyka-Łojasiewicz inequalities), and generalization errors. In the last part of the book some deep learning approximation methods for PDEs are reviewed including physics-informed neural networks (PINNs) and deep Galerkin methods. We hope that this book will be useful for students and scientists who do not yet have any background in deep learning at all and would like to gain a solid foundation as well as for practitioners who would like to obtain a firmer mathematical understanding of the objects and methods considered in deep learning.

Conditions of Use

This book is licensed under a Creative Commons License (CC BY-NC-SA). You can download the ebook Mathematical Introduction to Deep Learning for free.

Title: Mathematical Introduction to Deep Learning
Subtitle: Methods, Implementations, and Theory
Publisher: arxiv.org
Author(s): Arnulf Jentzen, Benno Kuckuck, Philippe von Wurstemberger
Published: 2023-10-31
Edition: 1
Format: eBook (pdf, epub, mobi)
Pages: 601
Language: English
License: CC BY-NC-SA
Book Homepage: Free eBook, Errata, Code, Solutions, etc.

Preface
Introduction
I Artificial neural networks (ANNs)
	Basics on ANNs
		Fully-connected feedforward ANNs (vectorized description)
			Affine functions
			Vectorized description of fully-connected feedforward ANNs
			Weight and bias parameters of fully-connected feedforward ANNs
		Activation functions
			Multidimensional versions
			Single hidden layer fully-connected feedforward ANNs
			Rectified linear unit (ReLU) activation
			Clipping activation
			Softplus activation
			Gaussian error linear unit (GELU) activation
			Standard logistic activation
			Swish activation
			Hyperbolic tangent activation
			Softsign activation
			Leaky rectified linear unit (leaky ReLU) activation
			Exponential linear unit (ELU) activation
			Rectified power unit (RePU) activation
			Sine activation
			Heaviside activation
			Softmax activation
		Fully-connected feedforward ANNs (structured description)
			Structured description of fully-connected feedforward ANNs
			Realizations of fully-connected feedforward ANNs
			On the connection to the vectorized description
		Convolutional ANNs (CNNs)
			Discrete convolutions
			Structured description of feedforward CNNs
			Realizations of feedforward CNNs
		Residual ANNs (ResNets)
			Structured description of fully-connected ResNets
			Realizations of fully-connected ResNets
		Recurrent ANNs (RNNs)
			Description of RNNs
			Vectorized description of simple fully-connected RNNs
			Long short-term memory (LSTM) RNNs
		Further types of ANNs
			ANNs with encoder-decoder architectures: autoencoders
			Transformers and the attention mechanism
			Graph neural networks (GNNs)
			Neural operators
	ANN calculus
		Compositions of fully-connected feedforward ANNs
			Compositions of fully-connected feedforward ANNs
			Elementary properties of compositions of fully-connected feedforward ANNs
			Associativity of compositions of fully-connected feedforward ANNs
			Powers of fully-connected feedforward ANNs
		Parallelizations of fully-connected feedforward ANNs
			Parallelizations of fully-connected feedforward ANNs with the same length
			Representations of the identities with ReLU activation functions
			Extensions of fully-connected feedforward ANNs
			Parallelizations of fully-connected feedforward ANNs with different lengths
		Scalar multiplications of fully-connected feedforward ANNs
			Affine transformations as fully-connected feedforward ANNs
			Scalar multiplications of fully-connected feedforward ANNs
		Sums of fully-connected feedforward ANNs with the same length
			Sums of vectors as fully-connected feedforward ANNs
			Concatenation of vectors as fully-connected feedforward ANNs
			Sums of fully-connected feedforward ANNs
II Approximation
	One-dimensional ANN approximation results
		Linear interpolation of one-dimensional functions
			On the modulus of continuity
			Linear interpolation of one-dimensional functions
		Linear interpolation with fully-connected feedforward ANNs
			Activation functions as fully-connected feedforward ANNs
			Representations for ReLU ANNs with one hidden neuron
			ReLU ANN representations for linear interpolations
		ANN approximations results for one-dimensional functions
			Constructive ANN approximation results
			Convergence rates for the approximation error
	Multi-dimensional ANN approximation results
		Approximations through supremal convolutions
		ANN representations
			ANN representations for the 1-norm
			ANN representations for maxima
			ANN representations for maximum convolutions
		ANN approximations results for multi-dimensional functions
			Constructive ANN approximation results
			Covering number estimates
			Convergence rates for the approximation error
		Refined ANN approximations results for multi-dimensional functions
			Rectified clipped ANNs
			Embedding ANNs in larger architectures
			Approximation through ANNs with variable architectures
			Refined convergence rates for the approximation error
III Optimization
	Optimization through gradient flow (GF) trajectories
		Introductory comments for the training of ANNs
		Basics for GFs
			GF ordinary differential equations (ODEs)
			Direction of negative gradients
		Regularity properties for ANNs
			On the differentiability of compositions of parametric functions
			On the differentiability of realizations of ANNs
		Loss functions
			Absolute error loss
			Mean squared error loss
			Huber error loss
			Cross-entropy loss
			Kullback–Leibler divergence loss
		GF optimization in the training of ANNs
		Lyapunov-type functions for GFs
			Gronwall differential inequalities
			Lyapunov-type functions for ODEs
			On Lyapunov-type functions and coercivity-type conditions
			Sufficient and necessary conditions for local minimum points
			On a linear growth condition
		Optimization through flows of ODEs
			Approximation of local minimum points through GFs
			Existence and uniqueness of solutions of ODEs
			Approximation of local minimum points through GFs revisited
			Approximation error with respect to the objective function
	Deterministic gradient descent (GD) optimization methods
		GD optimization
			GD optimization in the training of ANNs
			Euler discretizations for GF ODEs
			Lyapunov-type stability for GD optimization
			Error analysis for GD optimization
		Explicit midpoint GD optimization
			Explicit midpoint discretizations for GF ODEs
		GD optimization with classical momentum
			Representations for GD optimization with momentum
			Bias-adjusted GD optimization with momentum
			Error analysis for GD optimization with momentum
			Numerical comparisons for GD optimization with and without momentum
		GD optimization with Nesterov momentum
		Adagrad GD optimization (Adagrad)
		Root mean square propagation GD optimization (RMSprop)
			Representations of the mean square terms in RMSprop
			Bias-adjusted root mean square propagation GD optimization
		Adadelta GD optimization
		Adaptive moment estimation GD optimization (Adam)
	Stochastic gradient descent (SGD) optimization methods
		Introductory comments for the training of ANNs with SGD
		SGD optimization
			SGD optimization in the training of ANNs
			Non-convergence of SGD for not appropriately decaying learning rates
			Convergence rates for SGD for quadratic objective functions
			Convergence rates for SGD for coercive objective functions
		Explicit midpoint SGD optimization
		SGD optimization with classical momentum
			Bias-adjusted SGD optimization with classical momentum
		SGD optimization with Nesterov momentum
			Simplified SGD optimization with Nesterov momentum
		Adagrad SGD optimization (Adagrad)
		Root mean square propagation SGD optimization (RMSprop)
			Bias-adjusted root mean square propagation SGD optimization
		Adadelta SGD optimization
		Adaptive moment estimation SGD optimization (Adam)
	Backpropagation
		Backpropagation for parametric functions
		Backpropagation for ANNs
	Kurdyka–Łojasiewicz (KL) inequalities
		Standard KL functions
		Convergence analysis using standard KL functions (regular regime)
		Standard KL inequalities for monomials
		Standard KL inequalities around non-critical points
		Standard KL inequalities with increased exponents
		Standard KL inequalities for one-dimensional polynomials
		Power series and analytic functions
		Standard KL inequalities for one-dimensional analytic functions
		Standard KL inequalities for analytic functions
		Counterexamples
		Convergence analysis for solutions of GF ODEs
			Abstract local convergence results for GF processes
			Abstract global convergence results for GF processes
		Convergence analysis for GD processes
			One-step descent property for GD processes
			Abstract local convergence results for GD processes
		On the analyticity of realization functions of ANNs
		Standard KL inequalities for empirical risks in the training of ANNs with analytic activation functions
		Fréchet subdifferentials and limiting Fréchet subdifferentials
		Non-smooth slope
		Generalized KL functions
	ANNs with batch normalization
		Batch normalization (BN)
		Structured descr. of fully-connected feedforward ANNs with BN (training)
		Realizations of fully-connected feedforward ANNs with BN (training)
		Structured descr. of fully-connected feedforward ANNs with BN (inference)
		Realizations of fully-connected feedforward ANNs with BN (inference)
		On the connection between BN for training and BN for inference
	Optimization through random initializations
		Analysis of the optimization error
			The complementary distribution function formula
			Estimates for the optimization error involving complementary distribution functions
		Strong convergences rates for the optimization error
			Properties of the gamma and the beta function
			Product measurability of continuous random fields
			Strong convergences rates for the optimization error
		Strong convergences rates for the optimization error involving ANNs
			Local Lipschitz continuity estimates for the parametrization functions of ANNs
			Strong convergences rates for the optimization error involving ANNs
IV Generalization
	Probabilistic generalization error estimates
		Concentration inequalities for random variables
			Markov's inequality
			A first concentration inequality
			Moment-generating functions
			Chernoff bounds
			Hoeffding's inequality
			A strengthened Hoeffding's inequality
		Covering number estimates
			Entropy quantities
			Inequalities for packing entropy quantities in metric spaces
			Inequalities for covering entropy quantities in metric spaces
			Inequalities for entropy quantities in finite dimensional vector spaces
		Empirical risk minimization
			Concentration inequalities for random fields
			Uniform estimates for the statistical learning error
	Strong generalization error estimates
		Monte Carlo estimates
		Uniform strong error estimates for random fields
		Strong convergence rates for the generalisation error
V Composed error analysis
	Overall error decomposition
		Bias-variance decomposition
			Risk minimization for measurable functions
		Overall error decomposition
	Composed error estimates
		Full strong error analysis for the training of ANNs
		Full strong error analysis with optimization via SGD with random initializations
VI Deep learning for partial differential equations (PDEs)
	Physics-informed neural networks (PINNs)
		Reformulation of PDE problems as stochastic optimization problems
		Derivation of PINNs and deep Galerkin methods (DGMs)
		Implementation of PINNs
		Implementation of DGMs
	Deep Kolmogorov methods (DKMs)
		Stochastic optimization problems for expectations of random variables
		Stochastic optimization problems for expectations of random fields
		Feynman–Kac formulas
			Feynman–Kac formulas providing existence of solutions
			Feynman–Kac formulas providing uniqueness of solutions
		Reformulation of PDE problems as stochastic optimization problems
		Derivation of DKMs
		Implementation of DKMs
	Further deep learning methods for PDEs
		Deep learning methods based on strong formulations of PDEs
		Deep learning methods based on weak formulations of PDEs
		Deep learning methods based on stochastic representations of PDEs
		Error analyses for deep learning methods for PDEs
Index of abbreviations
List of figures
List of source codes
List of definitions
Bibliography

Algorithms Deep learning Mathematical

Free eBooks

Fairness and Machine Learning

Understanding Machine Learning

Related Books