Probabilistic Machine Learning

A detailed and up-to-date introduction to machine learning, presented through the unifying lens of probabilistic modeling and Bayesian decision theory.
This book offers a detailed and up-to-date introduction to machine learning (including deep learning) through the unifying lens of probabilistic modeling and Bayesian decision theory. The book covers mathematical background (including linear algebra and optimization), basic supervised learning (including linear and logistic regression and deep neural networks), as well as more advanced topics (including transfer learning and unsupervised learning). End-of-chapter exercises allow students to apply what they have learned, and an appendix covers notation.
Probabilistic Machine Learning grew out of the author’s 2012 book, Machine Learning: A Probabilistic Perspective. More than just a simple update, this is a completely new book that reflects the dramatic developments in the field since 2012, most notably deep learning. In addition, the new book is accompanied by online Python code, using libraries such as scikit-learn, JAX, PyTorch, and Tensorflow, which can be used to reproduce nearly all the figures; this code can be run inside a web browser using cloud-based notebooks, and provides a practical complement to the theoretical topics discussed in the book. This introductory text will be followed by a sequel that covers more advanced topics, taking the same probabilistic approach.
Conditions of Use
This book is licensed under a Creative Commons License (CC BY-NC-SA). You can download the ebook Probabilistic Machine Learning for free.
Title: Probabilistic Machine Learning
Subtitle: An Introduction
Publisher: The MIT Press
Author(s): Kevin P. Murphy
Published: 2022-02-01
Edition: 1
Format: eBook (pdf, epub, mobi)
Pages: 944
Language: English
ISBN-10: 0262046822
ISBN-13: 9780262046824
License: CC BY-NC-SA
Book Homepage: Free eBook, Errata, Code, Solutions, etc.
Copyright
Preface
1 Introduction
    1.1 What is machine learning?
    1.2 Supervised learning
        1.2.1 Classification
        1.2.2 Regression
        1.2.3 Overfitting and generalization
        1.2.4 No free lunch theorem
    1.3 Unsupervised learning
        1.3.1 Clustering
        1.3.2 Discovering latent “factors of variation”
        1.3.3 Self-supervised learning
        1.3.4 Evaluating unsupervised learning
    1.4 Reinforcement learning
    1.5 Data
        1.5.1 Some common image datasets
        1.5.2 Some common text datasets
        1.5.3 Preprocessing discrete input data
        1.5.4 Preprocessing text data
        1.5.5 Handling missing data
    1.6 Discussion
        1.6.1 The relationship between ML and other fields
        1.6.2 Structure of the book
        1.6.3 Caveats
I Foundations
    2 Probability: Univariate Models
        2.1 Introduction
            2.1.1 What is probability?
            2.1.2 Types of uncertainty
            2.1.3 Probability as an extension of logic
        2.2 Random variables
            2.2.1 Discrete random variables
            2.2.2 Continuous random variables
            2.2.3 Sets of related random variables
            2.2.4 Independence and conditional independence
            2.2.5 Moments of a distribution
            2.2.6 Limitations of summary statistics *
        2.3 Bayes’ rule
            2.3.1 Example: Testing for COVID-19
            2.3.2 Example: The Monty Hall problem
            2.3.3 Inverse problems *
        2.4 Bernoulli and binomial distributions
            2.4.1 Definition
            2.4.2 Sigmoid (logistic) function
            2.4.3 Binary logistic regression
        2.5 Categorical and multinomial distributions
            2.5.1 Definition
            2.5.2 Softmax function
            2.5.3 Multiclass logistic regression
            2.5.4 Log-sum-exp trick
        2.6 Univariate Gaussian (normal) distribution
            2.6.1 Cumulative distribution function
            2.6.2 Probability density function
            2.6.3 Regression
            2.6.4 Why is the Gaussian distribution so widely used?
            2.6.5 Dirac delta function as a limiting case
        2.7 Some other common univariate distributions *
            2.7.1 Student t distribution
            2.7.2 Cauchy distribution
            2.7.3 Laplace distribution
            2.7.4 Beta distribution
            2.7.5 Gamma distribution
            2.7.6 Empirical distribution
        2.8 Transformations of random variables *
            2.8.1 Discrete case
            2.8.2 Continuous case
            2.8.3 Invertible transformations (bijections)
            2.8.4 Moments of a linear transformation
            2.8.5 The convolution theorem
            2.8.6 Central limit theorem
            2.8.7 Monte Carlo approximation
        2.9 Exercises
    3 Probability: Multivariate Models
        3.1 Joint distributions for multiple random variables
            3.1.1 Covariance
            3.1.2 Correlation
            3.1.3 Uncorrelated does not imply independent
            3.1.4 Correlation does not imply causation
            3.1.5 Simpson’s paradox
        3.2 The multivariate Gaussian (normal) distribution
            3.2.1 Definition
            3.2.2 Mahalanobis distance
            3.2.3 Marginals and conditionals of an MVN *
            3.2.4 Example: conditioning a 2d Gaussian
            3.2.5 Example: Imputing missing values *
        3.3 Linear Gaussian systems *
            3.3.1 Bayes rule for Gaussians
            3.3.2 Derivation *
            3.3.3 Example: Inferring an unknown scalar
            3.3.4 Example: inferring an unknown vector
            3.3.5 Example: sensor fusion
        3.4 The exponential family *
            3.4.1 Definition
            3.4.2 Example
            3.4.3 Log partition function is cumulant generating function
            3.4.4 Maximum entropy derivation of the exponential family
        3.5 Mixture models
            3.5.1 Gaussian mixture models
            3.5.2 Bernoulli mixture models
        3.6 Probabilistic graphical models *
            3.6.1 Representation
            3.6.2 Inference
            3.6.3 Learning
        3.7 Exercises
    4 Statistics
        4.1 Introduction
        4.2 Maximum likelihood estimation (MLE)
            4.2.1 Definition
            4.2.2 Justification for MLE
            4.2.3 Example: MLE for the Bernoulli distribution
            4.2.4 Example: MLE for the categorical distribution
            4.2.5 Example: MLE for the univariate Gaussian
            4.2.6 Example: MLE for the multivariate Gaussian
            4.2.7 Example: MLE for linear regression
        4.3 Empirical risk minimization (ERM)
            4.3.1 Example: minimizing the misclassification rate
            4.3.2 Surrogate loss
        4.4 Other estimation methods *
            4.4.1 The method of moments
            4.4.2 Online (recursive) estimation
        4.5 Regularization
            4.5.1 Example: MAP estimation for the Bernoulli distribution
            4.5.2 Example: MAP estimation for the multivariate Gaussian *
            4.5.3 Example: weight decay
            4.5.4 Picking the regularizer using a validation set
            4.5.5 Cross-validation
            4.5.6 Early stopping
            4.5.7 Using more data
        4.6 Bayesian statistics *
            4.6.1 Conjugate priors
            4.6.2 The beta-binomial model
            4.6.3 The Dirichlet-multinomial model
            4.6.4 The Gaussian-Gaussian model
            4.6.5 Beyond conjugate priors
            4.6.6 Credible intervals
            4.6.7 Bayesian machine learning
            4.6.8 Computational issues
        4.7 Frequentist statistics *
            4.7.1 Sampling distributions
            4.7.2 Gaussian approximation of the sampling distribution of the MLE
            4.7.3 Bootstrap approximation of the sampling distribution of any estimator
            4.7.4 Confidence intervals
            4.7.5 Caution: Confidence intervals are not credible
            4.7.6 The bias-variance tradeoff
        4.8 Exercises
    5 Decision Theory
        5.1 Bayesian decision theory
            5.1.1 Basics
            5.1.2 Classification problems
            5.1.3 ROC curves
            5.1.4 Precision-recall curves
            5.1.5 Regression problems
            5.1.6 Probabilistic prediction problems
        5.2 Bayesian hypothesis testing
            5.2.1 Example: Testing if a coin is fair
            5.2.2 Bayesian model selection
            5.2.3 Occam’s razor
            5.2.4 Connection between cross validation and marginal likelihood
            5.2.5 Information criteria
        5.3 Frequentist decision theory
            5.3.1 Computing the risk of an estimator
            5.3.2 Consistent estimators
            5.3.3 Admissible estimators
        5.4 Empirical risk minimization
            5.4.1 Empirical risk
            5.4.2 Structural risk
            5.4.3 Cross-validation
            5.4.4 Statistical learning theory *
        5.5 Frequentist hypothesis testing *
            5.5.1 Likelihood ratio test
            5.5.2 Null hypothesis significance testing (NHST)
            5.5.3 p-values
            5.5.4 p-values considered harmful
            5.5.5 Why isn’t everyone a Bayesian?
        5.6 Exercises
    6 Information Theory
        6.1 Entropy
            6.1.1 Entropy for discrete random variables
            6.1.2 Cross entropy
            6.1.3 Joint entropy
            6.1.4 Conditional entropy
            6.1.5 Perplexity
            6.1.6 Differential entropy for continuous random variables *
        6.2 Relative entropy (KL divergence) *
            6.2.1 Definition
            6.2.2 Interpretation
            6.2.3 Example: KL divergence between two Gaussians
            6.2.4 Non-negativity of KL
            6.2.5 KL divergence and MLE
            6.2.6 Forward vs reverse KL
        6.3 Mutual information *
            6.3.1 Definition
            6.3.2 Interpretation
            6.3.3 Example
            6.3.4 Conditional mutual information
            6.3.5 MI as a “generalized correlation coefficient”
            6.3.6 Normalized mutual information
            6.3.7 Maximal information coefficient
            6.3.8 Data processing inequality
            6.3.9 Sufficient Statistics
            6.3.10 Fano’s inequality *
        6.4 Exercises
    7 Linear Algebra
        7.1 Introduction
            7.1.1 Notation
            7.1.2 Vector spaces
            7.1.3 Norms of a vector and matrix
            7.1.4 Properties of a matrix
            7.1.5 Special types of matrices
        7.2 Matrix multiplication
            7.2.1 Vector–vector products
            7.2.2 Matrix–vector products
            7.2.3 Matrix–matrix products
            7.2.4 Application: manipulating data matrices
            7.2.5 Kronecker products *
            7.2.6 Einstein summation *
        7.3 Matrix inversion
            7.3.1 The inverse of a square matrix
            7.3.2 Schur complements *
            7.3.3 The matrix inversion lemma *
            7.3.4 Matrix determinant lemma *
            7.3.5 Application: deriving the conditionals of an MVN *
        7.4 Eigenvalue decomposition (EVD)
            7.4.1 Basics
            7.4.2 Diagonalization
            7.4.3 Eigenvalues and eigenvectors of symmetric matrices
            7.4.4 Geometry of quadratic forms
            7.4.5 Standardizing and whitening data
            7.4.6 Power method
            7.4.7 Deflation
            7.4.8 Eigenvectors optimize quadratic forms
        7.5 Singular value decomposition (SVD)
            7.5.1 Basics
            7.5.2 Connection between SVD and EVD
            7.5.3 Pseudo inverse
            7.5.4 SVD and the range and null space of a matrix *
            7.5.5 Truncated SVD
        7.6 Other matrix decompositions *
            7.6.1 LU factorization
            7.6.2 QR decomposition
            7.6.3 Cholesky decomposition
        7.7 Solving systems of linear equations *
            7.7.1 Solving square systems
            7.7.2 Solving underconstrained systems (least norm estimation)
            7.7.3 Solving overconstrained systems (least squares estimation)
        7.8 Matrix calculus
            7.8.1 Derivatives
            7.8.2 Gradients
            7.8.3 Directional derivative
            7.8.4 Total derivative *
            7.8.5 Jacobian
            7.8.6 Hessian
            7.8.7 Gradients of commonly used functions
        7.9 Exercises
    8 Optimization
        8.1 Introduction
            8.1.1 Local vs global optimization
            8.1.2 Constrained vs unconstrained optimization
            8.1.3 Convex vs nonconvex optimization
            8.1.4 Smooth vs nonsmooth optimization
        8.2 First-order methods
            8.2.1 Descent direction
            8.2.2 Step size (learning rate)
            8.2.3 Convergence rates
            8.2.4 Momentum methods
        8.3 Second-order methods
            8.3.1 Newton’s method
            8.3.2 BFGS and other quasi-Newton methods
            8.3.3 Trust region methods
        8.4 Stochastic gradient descent
            8.4.1 Application to finite sum problems
            8.4.2 Example: SGD for fitting linear regression
            8.4.3 Choosing the step size (learning rate)
            8.4.4 Iterate averaging
            8.4.5 Variance reduction *
            8.4.6 Preconditioned SGD
        8.5 Constrained optimization
            8.5.1 Lagrange multipliers
            8.5.2 The KKT conditions
            8.5.3 Linear programming
            8.5.4 Quadratic programming
            8.5.5 Mixed integer linear programming *
        8.6 Proximal gradient method *
            8.6.1 Projected gradient descent
            8.6.2 Proximal operator for ℓ1-norm regularizer
            8.6.3 Proximal operator for quantization
            8.6.4 Incremental (online) proximal methods
        8.7 Bound optimization *
            8.7.1 The general algorithm
            8.7.2 The EM algorithm
            8.7.3 Example: EM for a GMM
        8.8 Blackbox and derivative free optimization
        8.9 Exercises
II Linear Models
    9 Linear Discriminant Analysis
        9.1 Introduction
        9.2 Gaussian discriminant analysis
            9.2.1 Quadratic decision boundaries
            9.2.2 Linear decision boundaries
            9.2.3 The connection between LDA and logistic regression
            9.2.4 Model fitting
            9.2.5 Nearest centroid classifier
            9.2.6 Fisher’s linear discriminant analysis *
        9.3 Naive Bayes classifiers
            9.3.1 Example models
            9.3.2 Model fitting
            9.3.3 Bayesian naive Bayes
            9.3.4 The connection between naive Bayes and logistic regression
        9.4 Generative vs discriminative classifiers
            9.4.1 Advantages of discriminative classifiers
            9.4.2 Advantages of generative classifiers
            9.4.3 Handling missing features
        9.5 Exercises
    10 Logistic Regression
        10.1 Introduction
        10.2 Binary logistic regression
            10.2.1 Linear classifiers
            10.2.2 Nonlinear classifiers
            10.2.3 Maximum likelihood estimation
            10.2.4 Stochastic gradient descent
            10.2.5 Perceptron algorithm
            10.2.6 Iteratively reweighted least squares
            10.2.7 MAP estimation
            10.2.8 Standardization
        10.3 Multinomial logistic regression
            10.3.1 Linear and nonlinear classifiers
            10.3.2 Maximum likelihood estimation
            10.3.3 Gradient-based optimization
            10.3.4 Bound optimization
            10.3.5 MAP estimation
            10.3.6 Maximum entropy classifiers
            10.3.7 Hierarchical classification
            10.3.8 Handling large numbers of classes
        10.4 Robust logistic regression *
            10.4.1 Mixture model for the likelihood
            10.4.2 Bi-tempered loss
        10.5 Bayesian logistic regression *
            10.5.1 Laplace approximation
            10.5.2 Approximating the posterior predictive
        10.6 Exercises
    11 Linear Regression
        11.1 Introduction
        11.2 Least squares linear regression
            11.2.1 Terminology
            11.2.2 Least squares estimation
            11.2.3 Other approaches to computing the MLE
            11.2.4 Measuring goodness of fit
        11.3 Ridge regression
            11.3.1 Computing the MAP estimate
            11.3.2 Connection between ridge regression and PCA
            11.3.3 Choosing the strength of the regularizer
        11.4 Lasso regression
            11.4.1 MAP estimation with a Laplace prior (ℓ1 regularization)
            11.4.2 Why does ℓ1 regularization yield sparse solutions?
            11.4.3 Hard vs soft thresholding
            11.4.4 Regularization path
            11.4.5 Comparison of least squares, lasso, ridge and subset selection
            11.4.6 Variable selection consistency
            11.4.7 Group lasso
            11.4.8 Elastic net (ridge and lasso combined)
            11.4.9 Optimization algorithms
        11.5 Regression splines *
            11.5.1 B-spline basis functions
            11.5.2 Fitting a linear model using a spline basis
            11.5.3 Smoothing splines
            11.5.4 Generalized additive models
        11.6 Robust linear regression *
            11.6.1 Laplace likelihood
            11.6.2 Student-t likelihood
            11.6.3 Huber loss
            11.6.4 RANSAC
        11.7 Bayesian linear regression *
            11.7.1 Priors
            11.7.2 Posteriors
            11.7.3 Example
            11.7.4 Computing the posterior predictive
            11.7.5 The advantage of centering
            11.7.6 Dealing with multicollinearity
            11.7.7 Automatic relevancy determination (ARD) *
        11.8 Exercises
    12 Generalized Linear Models *
        12.1 Introduction
        12.2 Examples
            12.2.1 Linear regression
            12.2.2 Binomial regression
            12.2.3 Poisson regression
        12.3 GLMs with non-canonical link functions
        12.4 Maximum likelihood estimation
        12.5 Worked example: predicting insurance claims
III Deep Neural Networks
    13 Neural Networks for Structured Data
        13.1 Introduction
        13.2 Multilayer perceptrons (MLPs)
            13.2.1 The XOR problem
            13.2.2 Differentiable MLPs
            13.2.3 Activation functions
            13.2.4 Example models
            13.2.5 The importance of depth
            13.2.6 The “deep learning revolution”
            13.2.7 Connections with biology
        13.3 Backpropagation
            13.3.1 Forward vs reverse mode differentiation
            13.3.2 Reverse mode differentiation for multilayer perceptrons
            13.3.3 Vector-Jacobian product for common layers
            13.3.4 Computation graphs
        13.4 Training neural networks
            13.4.1 Tuning the learning rate
            13.4.2 Vanishing and exploding gradients
            13.4.3 Non-saturating activation functions
            13.4.4 Residual connections
            13.4.5 Parameter initialization
            13.4.6 Parallel training
        13.5 Regularization
            13.5.1 Early stopping
            13.5.2 Weight decay
            13.5.3 Sparse DNNs
            13.5.4 Dropout
            13.5.5 Bayesian neural networks
            13.5.6 Regularization effects of (stochastic) gradient descent *
        13.6 Other kinds of feedforward networks *
            13.6.1 Radial basis function networks
            13.6.2 Mixtures of experts
        13.7 Exercises
    14 Neural Networks for Images
        14.1 Introduction
        14.2 Common layers
            14.2.1 Convolutional layers
            14.2.2 Pooling layers
            14.2.3 Putting it all together
            14.2.4 Normalization layers
        14.3 Common architectures for image classification
            14.3.1 LeNet
            14.3.2 AlexNet
            14.3.3 GoogLeNet (Inception)
            14.3.4 ResNet
            14.3.5 DenseNet
            14.3.6 Neural architecture search
        14.4 Other forms of convolution *
            14.4.1 Dilated convolution
            14.4.2 Transposed convolution
            14.4.3 Depthwise separable convolution
        14.5 Solving other discriminative vision tasks with CNNs *
            14.5.1 Image tagging
            14.5.2 Object detection
            14.5.3 Instance segmentation
            14.5.4 Semantic segmentation
            14.5.5 Human pose estimation
        14.6 Generating images by inverting CNNs *
            14.6.1 Converting a trained classifier into a generative model
            14.6.2 Image priors
            14.6.3 Visualizing the features learned by a CNN
            14.6.4 Deep Dream
            14.6.5 Neural style transfer
    15 Neural Networks for Sequences
        15.1 Introduction
        15.2 Recurrent neural networks (RNNs)
            15.2.1 Vec2Seq (sequence generation)
            15.2.2 Seq2Vec (sequence classification)
            15.2.3 Seq2Seq (sequence translation)
            15.2.4 Teacher forcing
            15.2.5 Backpropagation through time
            15.2.6 Vanishing and exploding gradients
            15.2.7 Gating and long term memory
            15.2.8 Beam search
        15.3 1d CNNs
            15.3.1 1d CNNs for sequence classification
            15.3.2 Causal 1d CNNs for sequence generation
        15.4 Attention
            15.4.1 Attention as soft dictionary lookup
            15.4.2 Kernel regression as non-parametric attention
            15.4.3 Parametric attention
            15.4.4 Seq2Seq with attention
            15.4.5 Seq2vec with attention (text classification)
            15.4.6 Seq+Seq2Vec with attention (text pair classification)
            15.4.7 Soft vs hard attention
        15.5 Transformers
            15.5.1 Self-attention
            15.5.2 Multi-headed attention
            15.5.3 Positional encoding
            15.5.4 Putting it all together
            15.5.5 Comparing transformers, CNNs and RNNs
            15.5.6 Transformers for images *
            15.5.7 Other transformer variants *
        15.6 Efficient transformers *
            15.6.1 Fixed non-learnable localized attention patterns
            15.6.2 Learnable sparse attention patterns
            15.6.3 Memory and recurrence methods
            15.6.4 Low-rank and kernel methods
        15.7 Language models and unsupervised representation learning
            15.7.1 ELMo
            15.7.2 BERT
            15.7.3 GPT
            15.7.4 T5
            15.7.5 Discussion
IV Nonparametric Models
    16 Exemplar-based Methods
        16.1 K nearest neighbor (KNN) classification
            16.1.1 Example
            16.1.2 The curse of dimensionality
            16.1.3 Reducing the speed and memory requirements
            16.1.4 Open set recognition
        16.2 Learning distance metrics
            16.2.1 Linear and convex methods
            16.2.2 Deep metric learning
            16.2.3 Classification losses
            16.2.4 Ranking losses
            16.2.5 Speeding up ranking loss optimization
            16.2.6 Other training tricks for DML
        16.3 Kernel density estimation (KDE)
            16.3.1 Density kernels
            16.3.2 Parzen window density estimator
            16.3.3 How to choose the bandwidth parameter
            16.3.4 From KDE to KNN classification
            16.3.5 Kernel regression
    17 Kernel Methods *
        17.1 Mercer kernels
            17.1.1 Mercer’s theorem
            17.1.2 Some popular Mercer kernels
        17.2 Gaussian processes
            17.2.1 Noise-free observations
            17.2.2 Noisy observations
            17.2.3 Comparison to kernel regression
            17.2.4 Weight space vs function space
            17.2.5 Numerical issues
            17.2.6 Estimating the kernel
            17.2.7 GPs for classification
            17.2.8 Connections with deep learning
            17.2.9 Scaling GPs to large datasets
        17.3 Support vector machines (SVMs)
            17.3.1 Large margin classifiers
            17.3.2 The dual problem
            17.3.3 Soft margin classifiers
            17.3.4 The kernel trick
            17.3.5 Converting SVM outputs into probabilities
            17.3.6 Connection with logistic regression
            17.3.7 Multi-class classification with SVMs
            17.3.8 How to choose the regularizer C
            17.3.9 Kernel ridge regression
            17.3.10 SVMs for regression
        17.4 Sparse vector machines
            17.4.1 Relevance vector machines (RVMs)
            17.4.2 Comparison of sparse and dense kernel methods
        17.5 Exercises
    18 Trees, Forests, Bagging, and Boosting
        18.1 Classification and regression trees (CART)
            18.1.1 Model definition
            18.1.2 Model fitting
            18.1.3 Regularization
            18.1.4 Handling missing input features
            18.1.5 Pros and cons
        18.2 Ensemble learning
            18.2.1 Stacking
            18.2.2 Ensembling is not Bayes model averaging
        18.3 Bagging
        18.4 Random forests
        18.5 Boosting
            18.5.1 Forward stagewise additive modeling
            18.5.2 Quadratic loss and least squares boosting
            18.5.3 Exponential loss and AdaBoost
            18.5.4 LogitBoost
            18.5.5 Gradient boosting
        18.6 Interpreting tree ensembles
            18.6.1 Feature importance
            18.6.2 Partial dependency plots
V Beyond Supervised Learning
    19 Learning with Fewer Labeled Examples
        19.1 Data augmentation
            19.1.1 Examples
            19.1.2 Theoretical justification
        19.2 Transfer learning
            19.2.1 Fine-tuning
            19.2.2 Adapters
            19.2.3 Supervised pre-training
            19.2.4 Unsupervised pre-training (self-supervised learning)
            19.2.5 Domain adaptation
        19.3 Semi-supervised learning
            19.3.1 Self-training and pseudo-labeling
            19.3.2 Entropy minimization
            19.3.3 Co-training
            19.3.4 Label propagation on graphs
            19.3.5 Consistency regularization
            19.3.6 Deep generative models *
            19.3.7 Combining self-supervised and semi-supervised learning
        19.4 Active learning
            19.4.1 Decision-theoretic approach
            19.4.2 Information-theoretic approach
            19.4.3 Batch active learning
        19.5 Meta-learning
            19.5.1 Model-agnostic meta-learning (MAML)
        19.6 Few-shot learning
            19.6.1 Matching networks
        19.7 Weakly supervised learning
        19.8 Exercises
    20 Dimensionality Reduction
        20.1 Principal components analysis (PCA)
            20.1.1 Examples
            20.1.2 Derivation of the algorithm
            20.1.3 Computational issues
            20.1.4 Choosing the number of latent dimensions
        20.2 Factor analysis *
            20.2.1 Generative model
            20.2.2 Probabilistic PCA
            20.2.3 EM algorithm for FA/PPCA
            20.2.4 Unidentifiability of the parameters
            20.2.5 Nonlinear factor analysis
            20.2.6 Mixtures of factor analysers
            20.2.7 Exponential family factor analysis
            20.2.8 Factor analysis models for paired data
        20.3 Autoencoders
            20.3.1 Bottleneck autoencoders
            20.3.2 Denoising autoencoders
            20.3.3 Contractive autoencoders
            20.3.4 Sparse autoencoders
            20.3.5 Variational autoencoders
        20.4 Manifold learning *
            20.4.1 What are manifolds?
            20.4.2 The manifold hypothesis
            20.4.3 Approaches to manifold learning
            20.4.4 Multi-dimensional scaling (MDS)
            20.4.5 Isomap
            20.4.6 Kernel PCA
            20.4.7 Maximum variance unfolding (MVU)
            20.4.8 Local linear embedding (LLE)
            20.4.9 Laplacian eigenmaps
            20.4.10 t-SNE
        20.5 Word embeddings
            20.5.1 Latent semantic analysis / indexing
            20.5.2 Word2vec
            20.5.3 GloVE
            20.5.4 Word analogies
            20.5.5 RAND-WALK model of word embeddings
            20.5.6 Contextual word embeddings
        20.6 Exercises
    21 Clustering
        21.1 Introduction
            21.1.1 Evaluating the output of clustering methods
        21.2 Hierarchical agglomerative clustering
            21.2.1 The algorithm
            21.2.2 Example
            21.2.3 Extensions
        21.3 K means clustering
            21.3.1 The algorithm
            21.3.2 Examples
            21.3.3 Vector quantization
            21.3.4 The K-means++ algorithm
            21.3.5 The K-medoids algorithm
            21.3.6 Speedup tricks
            21.3.7 Choosing the number of clusters K
        21.4 Clustering using mixture models
            21.4.1 Mixtures of Gaussians
            21.4.2 Mixtures of Bernoullis
        21.5 Spectral clustering *
            21.5.1 Normalized cuts
            21.5.2 Eigenvectors of the graph Laplacian encode the clustering
            21.5.3 Example
            21.5.4 Connection with other methods
        21.6 Biclustering *
            21.6.1 Basic biclustering
            21.6.2 Nested partition models (Crosscat)
    22 Recommender Systems
        22.1 Explicit feedback
            22.1.1 Datasets
            22.1.2 Collaborative filtering
            22.1.3 Matrix factorization
            22.1.4 Autoencoders
        22.2 Implicit feedback
            22.2.1 Bayesian personalized ranking
            22.2.2 Factorization machines
            22.2.3 Neural matrix factorization
        22.3 Leveraging side information
        22.4 Exploration-exploitation tradeoff
    23 Graph Embeddings *
        23.1 Introduction
        23.2 Graph Embedding as an Encoder/Decoder Problem
        23.3 Shallow graph embeddings
            23.3.1 Unsupervised embeddings
            23.3.2 Distance-based: Euclidean methods
            23.3.3 Distance-based: non-Euclidean methods
            23.3.4 Outer product-based: Matrix factorization methods
            23.3.5 Outer product-based: Skip-gram methods
            23.3.6 Supervised embeddings
        23.4 Graph Neural Networks
            23.4.1 Message passing GNNs
            23.4.2 Spectral Graph Convolutions
            23.4.3 Spatial Graph Convolutions
            23.4.4 Non-Euclidean Graph Convolutions
        23.5 Deep graph embeddings
            23.5.1 Unsupervised embeddings
            23.5.2 Semi-supervised embeddings
        23.6 Applications
            23.6.1 Unsupervised applications
            23.6.2 Supervised applications
A Notation
    A.1 Introduction
    A.2 Common mathematical symbols
    A.3 Functions
        A.3.1 Common functions of one argument
        A.3.2 Common functions of two arguments
        A.3.3 Common functions of > 2 arguments
    A.4 Linear algebra
        A.4.1 General notation
        A.4.2 Vectors
        A.4.3 Matrices
        A.4.4 Matrix calculus
    A.5 Optimization
    A.6 Probability
    A.7 Information theory
    A.8 Statistics and machine learning
        A.8.1 Supervised learning
        A.8.2 Unsupervised learning and generative models
        A.8.3 Bayesian inference
    A.9 Abbreviations
Index
Bibliography