This book aims to provide an introduction to the topic of deep learning algorithms. We review essential components of deep learning algorithms in full mathematical detail including different artificial neural network (ANN) architectures (such as fully-connected feedforward ANNs, convolutional ANNs, recurrent ANNs, residual ANNs, and ANNs with batch normalization) and different optimization algorithms (such as the basic stochastic gradient descent (SGD) method, accelerated methods, and adaptive methods). We also cover several theoretical aspects of deep learning algorithms such as approximation capacities of ANNs (including a calculus for ANNs), optimization theory (including Kurdyka-Łojasiewicz inequalities), and generalization errors. In the last part of the book some deep learning approximation methods for PDEs are reviewed including physics-informed neural networks (PINNs) and deep Galerkin methods. We hope that this book will be useful for students and scientists who do not yet have any background in deep learning at all and would like to gain a solid foundation as well as for practitioners who would like to obtain a firmer mathematical understanding of the objects and methods considered in deep learning.
Conditions of Use
This book is licensed under a Creative Commons License (CC BY-NC-SA). You can download the ebook Mathematical Introduction to Deep Learning for free.
- Title
- Mathematical Introduction to Deep Learning
- Subtitle
- Methods, Implementations, and Theory
- Publisher
- arxiv.org
- Author(s)
- Arnulf Jentzen, Benno Kuckuck, Philippe von Wurstemberger
- Published
- 2023-10-31
- Edition
- 1
- Format
- eBook (pdf, epub, mobi)
- Pages
- 601
- Language
- English
- License
- CC BY-NC-SA
- Book Homepage
- Free eBook, Errata, Code, Solutions, etc.
Preface Introduction I Artificial neural networks (ANNs) Basics on ANNs Fully-connected feedforward ANNs (vectorized description) Affine functions Vectorized description of fully-connected feedforward ANNs Weight and bias parameters of fully-connected feedforward ANNs Activation functions Multidimensional versions Single hidden layer fully-connected feedforward ANNs Rectified linear unit (ReLU) activation Clipping activation Softplus activation Gaussian error linear unit (GELU) activation Standard logistic activation Swish activation Hyperbolic tangent activation Softsign activation Leaky rectified linear unit (leaky ReLU) activation Exponential linear unit (ELU) activation Rectified power unit (RePU) activation Sine activation Heaviside activation Softmax activation Fully-connected feedforward ANNs (structured description) Structured description of fully-connected feedforward ANNs Realizations of fully-connected feedforward ANNs On the connection to the vectorized description Convolutional ANNs (CNNs) Discrete convolutions Structured description of feedforward CNNs Realizations of feedforward CNNs Residual ANNs (ResNets) Structured description of fully-connected ResNets Realizations of fully-connected ResNets Recurrent ANNs (RNNs) Description of RNNs Vectorized description of simple fully-connected RNNs Long short-term memory (LSTM) RNNs Further types of ANNs ANNs with encoder-decoder architectures: autoencoders Transformers and the attention mechanism Graph neural networks (GNNs) Neural operators ANN calculus Compositions of fully-connected feedforward ANNs Compositions of fully-connected feedforward ANNs Elementary properties of compositions of fully-connected feedforward ANNs Associativity of compositions of fully-connected feedforward ANNs Powers of fully-connected feedforward ANNs Parallelizations of fully-connected feedforward ANNs Parallelizations of fully-connected feedforward ANNs with the same length Representations of the identities with ReLU activation functions Extensions of fully-connected feedforward ANNs Parallelizations of fully-connected feedforward ANNs with different lengths Scalar multiplications of fully-connected feedforward ANNs Affine transformations as fully-connected feedforward ANNs Scalar multiplications of fully-connected feedforward ANNs Sums of fully-connected feedforward ANNs with the same length Sums of vectors as fully-connected feedforward ANNs Concatenation of vectors as fully-connected feedforward ANNs Sums of fully-connected feedforward ANNs II Approximation One-dimensional ANN approximation results Linear interpolation of one-dimensional functions On the modulus of continuity Linear interpolation of one-dimensional functions Linear interpolation with fully-connected feedforward ANNs Activation functions as fully-connected feedforward ANNs Representations for ReLU ANNs with one hidden neuron ReLU ANN representations for linear interpolations ANN approximations results for one-dimensional functions Constructive ANN approximation results Convergence rates for the approximation error Multi-dimensional ANN approximation results Approximations through supremal convolutions ANN representations ANN representations for the 1-norm ANN representations for maxima ANN representations for maximum convolutions ANN approximations results for multi-dimensional functions Constructive ANN approximation results Covering number estimates Convergence rates for the approximation error Refined ANN approximations results for multi-dimensional functions Rectified clipped ANNs Embedding ANNs in larger architectures Approximation through ANNs with variable architectures Refined convergence rates for the approximation error III Optimization Optimization through gradient flow (GF) trajectories Introductory comments for the training of ANNs Basics for GFs GF ordinary differential equations (ODEs) Direction of negative gradients Regularity properties for ANNs On the differentiability of compositions of parametric functions On the differentiability of realizations of ANNs Loss functions Absolute error loss Mean squared error loss Huber error loss Cross-entropy loss Kullback–Leibler divergence loss GF optimization in the training of ANNs Lyapunov-type functions for GFs Gronwall differential inequalities Lyapunov-type functions for ODEs On Lyapunov-type functions and coercivity-type conditions Sufficient and necessary conditions for local minimum points On a linear growth condition Optimization through flows of ODEs Approximation of local minimum points through GFs Existence and uniqueness of solutions of ODEs Approximation of local minimum points through GFs revisited Approximation error with respect to the objective function Deterministic gradient descent (GD) optimization methods GD optimization GD optimization in the training of ANNs Euler discretizations for GF ODEs Lyapunov-type stability for GD optimization Error analysis for GD optimization Explicit midpoint GD optimization Explicit midpoint discretizations for GF ODEs GD optimization with classical momentum Representations for GD optimization with momentum Bias-adjusted GD optimization with momentum Error analysis for GD optimization with momentum Numerical comparisons for GD optimization with and without momentum GD optimization with Nesterov momentum Adagrad GD optimization (Adagrad) Root mean square propagation GD optimization (RMSprop) Representations of the mean square terms in RMSprop Bias-adjusted root mean square propagation GD optimization Adadelta GD optimization Adaptive moment estimation GD optimization (Adam) Stochastic gradient descent (SGD) optimization methods Introductory comments for the training of ANNs with SGD SGD optimization SGD optimization in the training of ANNs Non-convergence of SGD for not appropriately decaying learning rates Convergence rates for SGD for quadratic objective functions Convergence rates for SGD for coercive objective functions Explicit midpoint SGD optimization SGD optimization with classical momentum Bias-adjusted SGD optimization with classical momentum SGD optimization with Nesterov momentum Simplified SGD optimization with Nesterov momentum Adagrad SGD optimization (Adagrad) Root mean square propagation SGD optimization (RMSprop) Bias-adjusted root mean square propagation SGD optimization Adadelta SGD optimization Adaptive moment estimation SGD optimization (Adam) Backpropagation Backpropagation for parametric functions Backpropagation for ANNs Kurdyka–Łojasiewicz (KL) inequalities Standard KL functions Convergence analysis using standard KL functions (regular regime) Standard KL inequalities for monomials Standard KL inequalities around non-critical points Standard KL inequalities with increased exponents Standard KL inequalities for one-dimensional polynomials Power series and analytic functions Standard KL inequalities for one-dimensional analytic functions Standard KL inequalities for analytic functions Counterexamples Convergence analysis for solutions of GF ODEs Abstract local convergence results for GF processes Abstract global convergence results for GF processes Convergence analysis for GD processes One-step descent property for GD processes Abstract local convergence results for GD processes On the analyticity of realization functions of ANNs Standard KL inequalities for empirical risks in the training of ANNs with analytic activation functions Fréchet subdifferentials and limiting Fréchet subdifferentials Non-smooth slope Generalized KL functions ANNs with batch normalization Batch normalization (BN) Structured descr. of fully-connected feedforward ANNs with BN (training) Realizations of fully-connected feedforward ANNs with BN (training) Structured descr. of fully-connected feedforward ANNs with BN (inference) Realizations of fully-connected feedforward ANNs with BN (inference) On the connection between BN for training and BN for inference Optimization through random initializations Analysis of the optimization error The complementary distribution function formula Estimates for the optimization error involving complementary distribution functions Strong convergences rates for the optimization error Properties of the gamma and the beta function Product measurability of continuous random fields Strong convergences rates for the optimization error Strong convergences rates for the optimization error involving ANNs Local Lipschitz continuity estimates for the parametrization functions of ANNs Strong convergences rates for the optimization error involving ANNs IV Generalization Probabilistic generalization error estimates Concentration inequalities for random variables Markov's inequality A first concentration inequality Moment-generating functions Chernoff bounds Hoeffding's inequality A strengthened Hoeffding's inequality Covering number estimates Entropy quantities Inequalities for packing entropy quantities in metric spaces Inequalities for covering entropy quantities in metric spaces Inequalities for entropy quantities in finite dimensional vector spaces Empirical risk minimization Concentration inequalities for random fields Uniform estimates for the statistical learning error Strong generalization error estimates Monte Carlo estimates Uniform strong error estimates for random fields Strong convergence rates for the generalisation error V Composed error analysis Overall error decomposition Bias-variance decomposition Risk minimization for measurable functions Overall error decomposition Composed error estimates Full strong error analysis for the training of ANNs Full strong error analysis with optimization via SGD with random initializations VI Deep learning for partial differential equations (PDEs) Physics-informed neural networks (PINNs) Reformulation of PDE problems as stochastic optimization problems Derivation of PINNs and deep Galerkin methods (DGMs) Implementation of PINNs Implementation of DGMs Deep Kolmogorov methods (DKMs) Stochastic optimization problems for expectations of random variables Stochastic optimization problems for expectations of random fields Feynman–Kac formulas Feynman–Kac formulas providing existence of solutions Feynman–Kac formulas providing uniqueness of solutions Reformulation of PDE problems as stochastic optimization problems Derivation of DKMs Implementation of DKMs Further deep learning methods for PDEs Deep learning methods based on strong formulations of PDEs Deep learning methods based on weak formulations of PDEs Deep learning methods based on stochastic representations of PDEs Error analyses for deep learning methods for PDEs Index of abbreviations List of figures List of source codes List of definitions Bibliography