The Shallow and the Deep - Open Tech Book

The Shallow and the Deep is a collection of lecture notes that offers an accessible introduction to neural networks and machine learning in general. However, it was clear from the beginning that these notes would not be able to cover this rapidly changing and growing field in its entirety. The focus lies on classical machine learning techniques, with a bias towards classification and regression. Other learning paradigms and many recent developments in, for instance, Deep Learning are not addressed or only briefly touched upon. Biehl argues that having a solid knowledge of the foundations of the field is essential, especially for anyone who wants to explore the world of machine learning with an ambition that goes beyond the application of some software package to some data set. Therefore, The Shallow and the Deep places emphasis on fundamental concepts and theoretical background. This also involves delving into the history and pre-history of neural networks, where the foundations for most of the recent developments were laid.

These notes aim to demystify machine learning and neural networks without losing the appreciation for their impressive power and versatility.

Conditions of Use

This book is licensed under a Creative Commons License (CC BY-NC-SA). You can download the ebook The Shallow and the Deep for free.

Title: The Shallow and the Deep
Subtitle: A biased introduction to neural networks and old school machine learning
Publisher: University of Groningen Press
Author(s): Michael Biehl
Published: 2023-09-27
Edition: 1
Format: eBook (pdf, epub, mobi)
Pages: 290
Language: English
ISBN-10: 9403430281
ISBN-13: 9789403430287
License: CC BY-NC-SA
Book Homepage: Free eBook, Errata, Code, Solutions, etc.

Preface iii
1 From neurons to networks
1.1 Spiking neurons and synaptic interactions
1.2 Firing rate models
1.2.1 Neural activity and synaptic interaction
1.2.2 Sigmoidal activation functions
1.2.3 Hebbian learning
1.3 Network architectures
1.3.1 Attractor networks and the Hopfield model
1.3.2 Feed-forward layered neural networks
1.3.3 Other architectures
2 Learning from example data
2.1 Learning scenarios
2.1.1 Unsupervised learning
2.1.2 Supervised learning
2.1.3 Other learning scenarios
2.2 Machine Learning vs. Statistical Modelling
2.2.1 Differences and commonalities
2.2.2 An example case: linear regression
2.2.3 Conclusion
3 The Perceptron
3.1 History and literature
3.2 Linearly separable functions
3.3 The Rosenblatt perceptron
3.3.1 The perceptron storage problem
3.3.2 Iterative Hebbian training algorithms
3.3.3 The Rosenblatt perceptron algorithm
3.3.4 The perceptron algorithm as gradient descent
3.3.5 The Perceptron Convergence Theorem
3.3.6 A few remarks
3.4 The capacity of a hyperplane
3.4.1 The number of linearly separable dichotomies
3.4.2 Discussion of the result
3.4.3 Time for a pizza or some cake
3.5 Learning a linearly separable rule
3.5.1 Student-teacher scenario
3.5.2 Learning in version space
3.5.3 Learning begins where storage ends
3.5.4 Optimal generalization
3.6 The perceptron of optimal stability
3.6.1 The stability criterion
3.6.2 The MinOver algorithm
3.7 Optimal stability by quadratic optimization
3.7.1 Optimal stability reformulated
3.7.2 The Adaptive Linear Neuron - Adaline
3.7.3 The Adaptive Perceptron Algorithm - AdaTron
3.7.4 Support vectors
3.8 Inhom. lin. sep. functions revisited
3.9 Some remarks
4 Beyond linear separability
4.1 Perceptron with errors
4.1.1 Minimal number of errors
4.1.2 Soft margin classifier
4.2 Layered networks of perceptron-like units
4.2.1 Committee and parity machines
4.2.2 The parity machine: a universal classifier
4.2.3 The capacity of machines
4.3 Support Vector Machines
4.3.1 Non-linear transformation to higher dimension
4.3.2 Large Margin classifier
4.3.3 The kernel trick
4.3.4 A few remarks
5 Feed-forward networks for regression and classification
5.1 Feed-forward networks as non-linear function approximators
5.1.1 Architecture and input–output relation
5.1.2 Universal approximators
5.1.3 A network for piecewise constant approximation
5.1.4 Variants of the Universal Approximation Theorem
5.2 Gradient based training of feed-forward nets
5.2.1 Gradient based training: Backpropagation of Error
5.2.2 Batch gradient descent
5.2.3 Stochastic gradient descent
5.2.4 Practical aspects and modifications of SGD
5.3 Alternative objective functions
5.3.1 Cost functions for regression
5.3.2 Cost functions for classification
5.4 Activation functions
5.4.1 Sigmoidal and related functions
5.4.2 One-sided and unbounded activation functions
5.4.3 Exponential and normalized activations
5.4.4 Remark: universal function approximation
5.5 Specific architectures
5.5.1 Popular shallow networks
5.5.2 Deep and convolutional neural networks
6 Distance-based classifiers
6.1 Prototype-based classifiers
6.1.1 Nearest Neighbor and Nearest Prototype Classifiers
6.1.2 Learning Vector Quantization
6.1.3 LVQ training algorithms
6.2 Distance measures and relevance learning
6.2.1 LVQ beyond Euclidean distance
6.2.2 Adaptive distances in relevance learning
6.3 Concluding remarks
7 Model evaluation and regularization
7.1 Bias and variance, over- and underfitting
7.1.1 Decomposition of the error
7.1.2 The Bias–Variance Dilemma
7.2 Controlling the network complexity
7.2.1 Early stopping
7.2.2 Weight decay and related concepts
7.2.3 Constructive algorithms
7.2.4 Pruning
7.2.5 Weight-sharing
7.2.6 Dropout
7.3 Cross-validation and related methods
7.3.1 n-fold cross-validation and related schemes
7.3.2 Model and parameter selection
7.4 Performance measures for regression
and classification
7.4.1 Measures for regression
7.4.2 Measures for classification
7.4.3 Receiver Operating Characteristics
7.4.4 The area under the ROC curve
7.4.5 Alternative measures for two-class problems
7.4.6 Multi-class problems
7.4.7 Averages of class-wise quality measures
7.5 Interpretable systems
8 Preprocessing and unsupervised learning
8.1 Normalization and transformations
8.1.1 Coordinate-wise transformations
8.1.2 Normalization
8.2 Dimensionality reduction
8.2.1 Low-dimensional embedding
8.2.2 Multi-dimensional Scaling
8.2.3 Neighborhood Embedding
8.2.4 Feature selection
8.3 PCA and related methods
8.3.1 Principal Component Analysis
8.3.2 PCA by Hebbian learning
8.3.3 Independent Component Analysis
8.4 Clustering and Vector Quantization
8.4.1 Basic clustering methods
8.4.2 Competitive learning for Vector Quantization
8.4.3 Practical issues and extensions of VQ
8.5 Density estimation
8.5.1 Parametric density estimation
8.5.2 Gaussian Mixture Models
8.6 Missing values and imputation techniques
8.6.1 Approaches without explicit imputation
8.6.2 Imputation based on available data
8.7 Over- and undersampling, augmentation
8.7.1 Weighted cost functions
8.7.2 Undersampling
8.7.3 Oversampling
8.7.4 Practical issues
8.7.5 Data augmentation
Concluding quote
A Optimization
A.1 Multi-dimensional Taylor expansion
A.2 Local extrema and saddle points
A.2.1 Necessary and sufficient conditions
A.2.2 Example: Unsolvable systems of linear equations
A.3 Constrained optimization
A.3.1 Equality constraints
A.3.2 Example: under-determined linear equations
A.3.3 Inequality constraints
A.3.4 The Wolfe Dual for convex problems
A.4 Gradient based optimization
A.4.1 Gradient and directional derivative
A.4.2 Gradient descent
A.4.3 The gradient under coordinate transformations
A.5 Variants of gradient descent
A.5.1 Coordinate descent or ascent
A.5.2 Constrained problems and projected gradients
A.5.3 Stochastic gradient descent
A.6 Example calculation of a gradient
Abbrev. and acronyms
Index
References