A detailed and up-to-date introduction to machine learning, presented through the unifying lens of probabilistic modeling and Bayesian decision theory.
This book offers a detailed and up-to-date introduction to machine learning (including deep learning) through the unifying lens of probabilistic modeling and Bayesian decision theory. The book covers mathematical background (including linear algebra and optimization), basic supervised learning (including linear and logistic regression and deep neural networks), as well as more advanced topics (including transfer learning and unsupervised learning). End-of-chapter exercises allow students to apply what they have learned, and an appendix covers notation.
Probabilistic Machine Learning grew out of the author’s 2012 book, Machine Learning: A Probabilistic Perspective. More than just a simple update, this is a completely new book that reflects the dramatic developments in the field since 2012, most notably deep learning. In addition, the new book is accompanied by online Python code, using libraries such as scikit-learn, JAX, PyTorch, and Tensorflow, which can be used to reproduce nearly all the figures; this code can be run inside a web browser using cloud-based notebooks, and provides a practical complement to the theoretical topics discussed in the book. This introductory text will be followed by a sequel that covers more advanced topics, taking the same probabilistic approach.
Conditions of Use
This book is licensed under a Creative Commons License (CC BY-NC-SA). You can download the ebook Probabilistic Machine Learning for free.
- Title
- Probabilistic Machine Learning
- Subtitle
- An Introduction
- Publisher
- The MIT Press
- Author(s)
- Kevin P. Murphy
- Published
- 2022-02-01
- Edition
- 1
- Format
- eBook (pdf, epub, mobi)
- Pages
- 944
- Language
- English
- ISBN-10
- 0262046822
- ISBN-13
- 9780262046824
- License
- CC BY-NC-SA
- Book Homepage
- Free eBook, Errata, Code, Solutions, etc.
Copyright Preface 1 Introduction 1.1 What is machine learning? 1.2 Supervised learning 1.2.1 Classification 1.2.2 Regression 1.2.3 Overfitting and generalization 1.2.4 No free lunch theorem 1.3 Unsupervised learning 1.3.1 Clustering 1.3.2 Discovering latent “factors of variation” 1.3.3 Self-supervised learning 1.3.4 Evaluating unsupervised learning 1.4 Reinforcement learning 1.5 Data 1.5.1 Some common image datasets 1.5.2 Some common text datasets 1.5.3 Preprocessing discrete input data 1.5.4 Preprocessing text data 1.5.5 Handling missing data 1.6 Discussion 1.6.1 The relationship between ML and other fields 1.6.2 Structure of the book 1.6.3 Caveats I Foundations 2 Probability: Univariate Models 2.1 Introduction 2.1.1 What is probability? 2.1.2 Types of uncertainty 2.1.3 Probability as an extension of logic 2.2 Random variables 2.2.1 Discrete random variables 2.2.2 Continuous random variables 2.2.3 Sets of related random variables 2.2.4 Independence and conditional independence 2.2.5 Moments of a distribution 2.2.6 Limitations of summary statistics * 2.3 Bayes’ rule 2.3.1 Example: Testing for COVID-19 2.3.2 Example: The Monty Hall problem 2.3.3 Inverse problems * 2.4 Bernoulli and binomial distributions 2.4.1 Definition 2.4.2 Sigmoid (logistic) function 2.4.3 Binary logistic regression 2.5 Categorical and multinomial distributions 2.5.1 Definition 2.5.2 Softmax function 2.5.3 Multiclass logistic regression 2.5.4 Log-sum-exp trick 2.6 Univariate Gaussian (normal) distribution 2.6.1 Cumulative distribution function 2.6.2 Probability density function 2.6.3 Regression 2.6.4 Why is the Gaussian distribution so widely used? 2.6.5 Dirac delta function as a limiting case 2.7 Some other common univariate distributions * 2.7.1 Student t distribution 2.7.2 Cauchy distribution 2.7.3 Laplace distribution 2.7.4 Beta distribution 2.7.5 Gamma distribution 2.7.6 Empirical distribution 2.8 Transformations of random variables * 2.8.1 Discrete case 2.8.2 Continuous case 2.8.3 Invertible transformations (bijections) 2.8.4 Moments of a linear transformation 2.8.5 The convolution theorem 2.8.6 Central limit theorem 2.8.7 Monte Carlo approximation 2.9 Exercises 3 Probability: Multivariate Models 3.1 Joint distributions for multiple random variables 3.1.1 Covariance 3.1.2 Correlation 3.1.3 Uncorrelated does not imply independent 3.1.4 Correlation does not imply causation 3.1.5 Simpson’s paradox 3.2 The multivariate Gaussian (normal) distribution 3.2.1 Definition 3.2.2 Mahalanobis distance 3.2.3 Marginals and conditionals of an MVN * 3.2.4 Example: conditioning a 2d Gaussian 3.2.5 Example: Imputing missing values * 3.3 Linear Gaussian systems * 3.3.1 Bayes rule for Gaussians 3.3.2 Derivation * 3.3.3 Example: Inferring an unknown scalar 3.3.4 Example: inferring an unknown vector 3.3.5 Example: sensor fusion 3.4 The exponential family * 3.4.1 Definition 3.4.2 Example 3.4.3 Log partition function is cumulant generating function 3.4.4 Maximum entropy derivation of the exponential family 3.5 Mixture models 3.5.1 Gaussian mixture models 3.5.2 Bernoulli mixture models 3.6 Probabilistic graphical models * 3.6.1 Representation 3.6.2 Inference 3.6.3 Learning 3.7 Exercises 4 Statistics 4.1 Introduction 4.2 Maximum likelihood estimation (MLE) 4.2.1 Definition 4.2.2 Justification for MLE 4.2.3 Example: MLE for the Bernoulli distribution 4.2.4 Example: MLE for the categorical distribution 4.2.5 Example: MLE for the univariate Gaussian 4.2.6 Example: MLE for the multivariate Gaussian 4.2.7 Example: MLE for linear regression 4.3 Empirical risk minimization (ERM) 4.3.1 Example: minimizing the misclassification rate 4.3.2 Surrogate loss 4.4 Other estimation methods * 4.4.1 The method of moments 4.4.2 Online (recursive) estimation 4.5 Regularization 4.5.1 Example: MAP estimation for the Bernoulli distribution 4.5.2 Example: MAP estimation for the multivariate Gaussian * 4.5.3 Example: weight decay 4.5.4 Picking the regularizer using a validation set 4.5.5 Cross-validation 4.5.6 Early stopping 4.5.7 Using more data 4.6 Bayesian statistics * 4.6.1 Conjugate priors 4.6.2 The beta-binomial model 4.6.3 The Dirichlet-multinomial model 4.6.4 The Gaussian-Gaussian model 4.6.5 Beyond conjugate priors 4.6.6 Credible intervals 4.6.7 Bayesian machine learning 4.6.8 Computational issues 4.7 Frequentist statistics * 4.7.1 Sampling distributions 4.7.2 Gaussian approximation of the sampling distribution of the MLE 4.7.3 Bootstrap approximation of the sampling distribution of any estimator 4.7.4 Confidence intervals 4.7.5 Caution: Confidence intervals are not credible 4.7.6 The bias-variance tradeoff 4.8 Exercises 5 Decision Theory 5.1 Bayesian decision theory 5.1.1 Basics 5.1.2 Classification problems 5.1.3 ROC curves 5.1.4 Precision-recall curves 5.1.5 Regression problems 5.1.6 Probabilistic prediction problems 5.2 Bayesian hypothesis testing 5.2.1 Example: Testing if a coin is fair 5.2.2 Bayesian model selection 5.2.3 Occam’s razor 5.2.4 Connection between cross validation and marginal likelihood 5.2.5 Information criteria 5.3 Frequentist decision theory 5.3.1 Computing the risk of an estimator 5.3.2 Consistent estimators 5.3.3 Admissible estimators 5.4 Empirical risk minimization 5.4.1 Empirical risk 5.4.2 Structural risk 5.4.3 Cross-validation 5.4.4 Statistical learning theory * 5.5 Frequentist hypothesis testing * 5.5.1 Likelihood ratio test 5.5.2 Null hypothesis significance testing (NHST) 5.5.3 p-values 5.5.4 p-values considered harmful 5.5.5 Why isn’t everyone a Bayesian? 5.6 Exercises 6 Information Theory 6.1 Entropy 6.1.1 Entropy for discrete random variables 6.1.2 Cross entropy 6.1.3 Joint entropy 6.1.4 Conditional entropy 6.1.5 Perplexity 6.1.6 Differential entropy for continuous random variables * 6.2 Relative entropy (KL divergence) * 6.2.1 Definition 6.2.2 Interpretation 6.2.3 Example: KL divergence between two Gaussians 6.2.4 Non-negativity of KL 6.2.5 KL divergence and MLE 6.2.6 Forward vs reverse KL 6.3 Mutual information * 6.3.1 Definition 6.3.2 Interpretation 6.3.3 Example 6.3.4 Conditional mutual information 6.3.5 MI as a “generalized correlation coefficient” 6.3.6 Normalized mutual information 6.3.7 Maximal information coefficient 6.3.8 Data processing inequality 6.3.9 Sufficient Statistics 6.3.10 Fano’s inequality * 6.4 Exercises 7 Linear Algebra 7.1 Introduction 7.1.1 Notation 7.1.2 Vector spaces 7.1.3 Norms of a vector and matrix 7.1.4 Properties of a matrix 7.1.5 Special types of matrices 7.2 Matrix multiplication 7.2.1 Vector–vector products 7.2.2 Matrix–vector products 7.2.3 Matrix–matrix products 7.2.4 Application: manipulating data matrices 7.2.5 Kronecker products * 7.2.6 Einstein summation * 7.3 Matrix inversion 7.3.1 The inverse of a square matrix 7.3.2 Schur complements * 7.3.3 The matrix inversion lemma * 7.3.4 Matrix determinant lemma * 7.3.5 Application: deriving the conditionals of an MVN * 7.4 Eigenvalue decomposition (EVD) 7.4.1 Basics 7.4.2 Diagonalization 7.4.3 Eigenvalues and eigenvectors of symmetric matrices 7.4.4 Geometry of quadratic forms 7.4.5 Standardizing and whitening data 7.4.6 Power method 7.4.7 Deflation 7.4.8 Eigenvectors optimize quadratic forms 7.5 Singular value decomposition (SVD) 7.5.1 Basics 7.5.2 Connection between SVD and EVD 7.5.3 Pseudo inverse 7.5.4 SVD and the range and null space of a matrix * 7.5.5 Truncated SVD 7.6 Other matrix decompositions * 7.6.1 LU factorization 7.6.2 QR decomposition 7.6.3 Cholesky decomposition 7.7 Solving systems of linear equations * 7.7.1 Solving square systems 7.7.2 Solving underconstrained systems (least norm estimation) 7.7.3 Solving overconstrained systems (least squares estimation) 7.8 Matrix calculus 7.8.1 Derivatives 7.8.2 Gradients 7.8.3 Directional derivative 7.8.4 Total derivative * 7.8.5 Jacobian 7.8.6 Hessian 7.8.7 Gradients of commonly used functions 7.9 Exercises 8 Optimization 8.1 Introduction 8.1.1 Local vs global optimization 8.1.2 Constrained vs unconstrained optimization 8.1.3 Convex vs nonconvex optimization 8.1.4 Smooth vs nonsmooth optimization 8.2 First-order methods 8.2.1 Descent direction 8.2.2 Step size (learning rate) 8.2.3 Convergence rates 8.2.4 Momentum methods 8.3 Second-order methods 8.3.1 Newton’s method 8.3.2 BFGS and other quasi-Newton methods 8.3.3 Trust region methods 8.4 Stochastic gradient descent 8.4.1 Application to finite sum problems 8.4.2 Example: SGD for fitting linear regression 8.4.3 Choosing the step size (learning rate) 8.4.4 Iterate averaging 8.4.5 Variance reduction * 8.4.6 Preconditioned SGD 8.5 Constrained optimization 8.5.1 Lagrange multipliers 8.5.2 The KKT conditions 8.5.3 Linear programming 8.5.4 Quadratic programming 8.5.5 Mixed integer linear programming * 8.6 Proximal gradient method * 8.6.1 Projected gradient descent 8.6.2 Proximal operator for ℓ1-norm regularizer 8.6.3 Proximal operator for quantization 8.6.4 Incremental (online) proximal methods 8.7 Bound optimization * 8.7.1 The general algorithm 8.7.2 The EM algorithm 8.7.3 Example: EM for a GMM 8.8 Blackbox and derivative free optimization 8.9 Exercises II Linear Models 9 Linear Discriminant Analysis 9.1 Introduction 9.2 Gaussian discriminant analysis 9.2.1 Quadratic decision boundaries 9.2.2 Linear decision boundaries 9.2.3 The connection between LDA and logistic regression 9.2.4 Model fitting 9.2.5 Nearest centroid classifier 9.2.6 Fisher’s linear discriminant analysis * 9.3 Naive Bayes classifiers 9.3.1 Example models 9.3.2 Model fitting 9.3.3 Bayesian naive Bayes 9.3.4 The connection between naive Bayes and logistic regression 9.4 Generative vs discriminative classifiers 9.4.1 Advantages of discriminative classifiers 9.4.2 Advantages of generative classifiers 9.4.3 Handling missing features 9.5 Exercises 10 Logistic Regression 10.1 Introduction 10.2 Binary logistic regression 10.2.1 Linear classifiers 10.2.2 Nonlinear classifiers 10.2.3 Maximum likelihood estimation 10.2.4 Stochastic gradient descent 10.2.5 Perceptron algorithm 10.2.6 Iteratively reweighted least squares 10.2.7 MAP estimation 10.2.8 Standardization 10.3 Multinomial logistic regression 10.3.1 Linear and nonlinear classifiers 10.3.2 Maximum likelihood estimation 10.3.3 Gradient-based optimization 10.3.4 Bound optimization 10.3.5 MAP estimation 10.3.6 Maximum entropy classifiers 10.3.7 Hierarchical classification 10.3.8 Handling large numbers of classes 10.4 Robust logistic regression * 10.4.1 Mixture model for the likelihood 10.4.2 Bi-tempered loss 10.5 Bayesian logistic regression * 10.5.1 Laplace approximation 10.5.2 Approximating the posterior predictive 10.6 Exercises 11 Linear Regression 11.1 Introduction 11.2 Least squares linear regression 11.2.1 Terminology 11.2.2 Least squares estimation 11.2.3 Other approaches to computing the MLE 11.2.4 Measuring goodness of fit 11.3 Ridge regression 11.3.1 Computing the MAP estimate 11.3.2 Connection between ridge regression and PCA 11.3.3 Choosing the strength of the regularizer 11.4 Lasso regression 11.4.1 MAP estimation with a Laplace prior (ℓ1 regularization) 11.4.2 Why does ℓ1 regularization yield sparse solutions? 11.4.3 Hard vs soft thresholding 11.4.4 Regularization path 11.4.5 Comparison of least squares, lasso, ridge and subset selection 11.4.6 Variable selection consistency 11.4.7 Group lasso 11.4.8 Elastic net (ridge and lasso combined) 11.4.9 Optimization algorithms 11.5 Regression splines * 11.5.1 B-spline basis functions 11.5.2 Fitting a linear model using a spline basis 11.5.3 Smoothing splines 11.5.4 Generalized additive models 11.6 Robust linear regression * 11.6.1 Laplace likelihood 11.6.2 Student-t likelihood 11.6.3 Huber loss 11.6.4 RANSAC 11.7 Bayesian linear regression * 11.7.1 Priors 11.7.2 Posteriors 11.7.3 Example 11.7.4 Computing the posterior predictive 11.7.5 The advantage of centering 11.7.6 Dealing with multicollinearity 11.7.7 Automatic relevancy determination (ARD) * 11.8 Exercises 12 Generalized Linear Models * 12.1 Introduction 12.2 Examples 12.2.1 Linear regression 12.2.2 Binomial regression 12.2.3 Poisson regression 12.3 GLMs with non-canonical link functions 12.4 Maximum likelihood estimation 12.5 Worked example: predicting insurance claims III Deep Neural Networks 13 Neural Networks for Structured Data 13.1 Introduction 13.2 Multilayer perceptrons (MLPs) 13.2.1 The XOR problem 13.2.2 Differentiable MLPs 13.2.3 Activation functions 13.2.4 Example models 13.2.5 The importance of depth 13.2.6 The “deep learning revolution” 13.2.7 Connections with biology 13.3 Backpropagation 13.3.1 Forward vs reverse mode differentiation 13.3.2 Reverse mode differentiation for multilayer perceptrons 13.3.3 Vector-Jacobian product for common layers 13.3.4 Computation graphs 13.4 Training neural networks 13.4.1 Tuning the learning rate 13.4.2 Vanishing and exploding gradients 13.4.3 Non-saturating activation functions 13.4.4 Residual connections 13.4.5 Parameter initialization 13.4.6 Parallel training 13.5 Regularization 13.5.1 Early stopping 13.5.2 Weight decay 13.5.3 Sparse DNNs 13.5.4 Dropout 13.5.5 Bayesian neural networks 13.5.6 Regularization effects of (stochastic) gradient descent * 13.6 Other kinds of feedforward networks * 13.6.1 Radial basis function networks 13.6.2 Mixtures of experts 13.7 Exercises 14 Neural Networks for Images 14.1 Introduction 14.2 Common layers 14.2.1 Convolutional layers 14.2.2 Pooling layers 14.2.3 Putting it all together 14.2.4 Normalization layers 14.3 Common architectures for image classification 14.3.1 LeNet 14.3.2 AlexNet 14.3.3 GoogLeNet (Inception) 14.3.4 ResNet 14.3.5 DenseNet 14.3.6 Neural architecture search 14.4 Other forms of convolution * 14.4.1 Dilated convolution 14.4.2 Transposed convolution 14.4.3 Depthwise separable convolution 14.5 Solving other discriminative vision tasks with CNNs * 14.5.1 Image tagging 14.5.2 Object detection 14.5.3 Instance segmentation 14.5.4 Semantic segmentation 14.5.5 Human pose estimation 14.6 Generating images by inverting CNNs * 14.6.1 Converting a trained classifier into a generative model 14.6.2 Image priors 14.6.3 Visualizing the features learned by a CNN 14.6.4 Deep Dream 14.6.5 Neural style transfer 15 Neural Networks for Sequences 15.1 Introduction 15.2 Recurrent neural networks (RNNs) 15.2.1 Vec2Seq (sequence generation) 15.2.2 Seq2Vec (sequence classification) 15.2.3 Seq2Seq (sequence translation) 15.2.4 Teacher forcing 15.2.5 Backpropagation through time 15.2.6 Vanishing and exploding gradients 15.2.7 Gating and long term memory 15.2.8 Beam search 15.3 1d CNNs 15.3.1 1d CNNs for sequence classification 15.3.2 Causal 1d CNNs for sequence generation 15.4 Attention 15.4.1 Attention as soft dictionary lookup 15.4.2 Kernel regression as non-parametric attention 15.4.3 Parametric attention 15.4.4 Seq2Seq with attention 15.4.5 Seq2vec with attention (text classification) 15.4.6 Seq+Seq2Vec with attention (text pair classification) 15.4.7 Soft vs hard attention 15.5 Transformers 15.5.1 Self-attention 15.5.2 Multi-headed attention 15.5.3 Positional encoding 15.5.4 Putting it all together 15.5.5 Comparing transformers, CNNs and RNNs 15.5.6 Transformers for images * 15.5.7 Other transformer variants * 15.6 Efficient transformers * 15.6.1 Fixed non-learnable localized attention patterns 15.6.2 Learnable sparse attention patterns 15.6.3 Memory and recurrence methods 15.6.4 Low-rank and kernel methods 15.7 Language models and unsupervised representation learning 15.7.1 ELMo 15.7.2 BERT 15.7.3 GPT 15.7.4 T5 15.7.5 Discussion IV Nonparametric Models 16 Exemplar-based Methods 16.1 K nearest neighbor (KNN) classification 16.1.1 Example 16.1.2 The curse of dimensionality 16.1.3 Reducing the speed and memory requirements 16.1.4 Open set recognition 16.2 Learning distance metrics 16.2.1 Linear and convex methods 16.2.2 Deep metric learning 16.2.3 Classification losses 16.2.4 Ranking losses 16.2.5 Speeding up ranking loss optimization 16.2.6 Other training tricks for DML 16.3 Kernel density estimation (KDE) 16.3.1 Density kernels 16.3.2 Parzen window density estimator 16.3.3 How to choose the bandwidth parameter 16.3.4 From KDE to KNN classification 16.3.5 Kernel regression 17 Kernel Methods * 17.1 Mercer kernels 17.1.1 Mercer’s theorem 17.1.2 Some popular Mercer kernels 17.2 Gaussian processes 17.2.1 Noise-free observations 17.2.2 Noisy observations 17.2.3 Comparison to kernel regression 17.2.4 Weight space vs function space 17.2.5 Numerical issues 17.2.6 Estimating the kernel 17.2.7 GPs for classification 17.2.8 Connections with deep learning 17.2.9 Scaling GPs to large datasets 17.3 Support vector machines (SVMs) 17.3.1 Large margin classifiers 17.3.2 The dual problem 17.3.3 Soft margin classifiers 17.3.4 The kernel trick 17.3.5 Converting SVM outputs into probabilities 17.3.6 Connection with logistic regression 17.3.7 Multi-class classification with SVMs 17.3.8 How to choose the regularizer C 17.3.9 Kernel ridge regression 17.3.10 SVMs for regression 17.4 Sparse vector machines 17.4.1 Relevance vector machines (RVMs) 17.4.2 Comparison of sparse and dense kernel methods 17.5 Exercises 18 Trees, Forests, Bagging, and Boosting 18.1 Classification and regression trees (CART) 18.1.1 Model definition 18.1.2 Model fitting 18.1.3 Regularization 18.1.4 Handling missing input features 18.1.5 Pros and cons 18.2 Ensemble learning 18.2.1 Stacking 18.2.2 Ensembling is not Bayes model averaging 18.3 Bagging 18.4 Random forests 18.5 Boosting 18.5.1 Forward stagewise additive modeling 18.5.2 Quadratic loss and least squares boosting 18.5.3 Exponential loss and AdaBoost 18.5.4 LogitBoost 18.5.5 Gradient boosting 18.6 Interpreting tree ensembles 18.6.1 Feature importance 18.6.2 Partial dependency plots V Beyond Supervised Learning 19 Learning with Fewer Labeled Examples 19.1 Data augmentation 19.1.1 Examples 19.1.2 Theoretical justification 19.2 Transfer learning 19.2.1 Fine-tuning 19.2.2 Adapters 19.2.3 Supervised pre-training 19.2.4 Unsupervised pre-training (self-supervised learning) 19.2.5 Domain adaptation 19.3 Semi-supervised learning 19.3.1 Self-training and pseudo-labeling 19.3.2 Entropy minimization 19.3.3 Co-training 19.3.4 Label propagation on graphs 19.3.5 Consistency regularization 19.3.6 Deep generative models * 19.3.7 Combining self-supervised and semi-supervised learning 19.4 Active learning 19.4.1 Decision-theoretic approach 19.4.2 Information-theoretic approach 19.4.3 Batch active learning 19.5 Meta-learning 19.5.1 Model-agnostic meta-learning (MAML) 19.6 Few-shot learning 19.6.1 Matching networks 19.7 Weakly supervised learning 19.8 Exercises 20 Dimensionality Reduction 20.1 Principal components analysis (PCA) 20.1.1 Examples 20.1.2 Derivation of the algorithm 20.1.3 Computational issues 20.1.4 Choosing the number of latent dimensions 20.2 Factor analysis * 20.2.1 Generative model 20.2.2 Probabilistic PCA 20.2.3 EM algorithm for FA/PPCA 20.2.4 Unidentifiability of the parameters 20.2.5 Nonlinear factor analysis 20.2.6 Mixtures of factor analysers 20.2.7 Exponential family factor analysis 20.2.8 Factor analysis models for paired data 20.3 Autoencoders 20.3.1 Bottleneck autoencoders 20.3.2 Denoising autoencoders 20.3.3 Contractive autoencoders 20.3.4 Sparse autoencoders 20.3.5 Variational autoencoders 20.4 Manifold learning * 20.4.1 What are manifolds? 20.4.2 The manifold hypothesis 20.4.3 Approaches to manifold learning 20.4.4 Multi-dimensional scaling (MDS) 20.4.5 Isomap 20.4.6 Kernel PCA 20.4.7 Maximum variance unfolding (MVU) 20.4.8 Local linear embedding (LLE) 20.4.9 Laplacian eigenmaps 20.4.10 t-SNE 20.5 Word embeddings 20.5.1 Latent semantic analysis / indexing 20.5.2 Word2vec 20.5.3 GloVE 20.5.4 Word analogies 20.5.5 RAND-WALK model of word embeddings 20.5.6 Contextual word embeddings 20.6 Exercises 21 Clustering 21.1 Introduction 21.1.1 Evaluating the output of clustering methods 21.2 Hierarchical agglomerative clustering 21.2.1 The algorithm 21.2.2 Example 21.2.3 Extensions 21.3 K means clustering 21.3.1 The algorithm 21.3.2 Examples 21.3.3 Vector quantization 21.3.4 The K-means++ algorithm 21.3.5 The K-medoids algorithm 21.3.6 Speedup tricks 21.3.7 Choosing the number of clusters K 21.4 Clustering using mixture models 21.4.1 Mixtures of Gaussians 21.4.2 Mixtures of Bernoullis 21.5 Spectral clustering * 21.5.1 Normalized cuts 21.5.2 Eigenvectors of the graph Laplacian encode the clustering 21.5.3 Example 21.5.4 Connection with other methods 21.6 Biclustering * 21.6.1 Basic biclustering 21.6.2 Nested partition models (Crosscat) 22 Recommender Systems 22.1 Explicit feedback 22.1.1 Datasets 22.1.2 Collaborative filtering 22.1.3 Matrix factorization 22.1.4 Autoencoders 22.2 Implicit feedback 22.2.1 Bayesian personalized ranking 22.2.2 Factorization machines 22.2.3 Neural matrix factorization 22.3 Leveraging side information 22.4 Exploration-exploitation tradeoff 23 Graph Embeddings * 23.1 Introduction 23.2 Graph Embedding as an Encoder/Decoder Problem 23.3 Shallow graph embeddings 23.3.1 Unsupervised embeddings 23.3.2 Distance-based: Euclidean methods 23.3.3 Distance-based: non-Euclidean methods 23.3.4 Outer product-based: Matrix factorization methods 23.3.5 Outer product-based: Skip-gram methods 23.3.6 Supervised embeddings 23.4 Graph Neural Networks 23.4.1 Message passing GNNs 23.4.2 Spectral Graph Convolutions 23.4.3 Spatial Graph Convolutions 23.4.4 Non-Euclidean Graph Convolutions 23.5 Deep graph embeddings 23.5.1 Unsupervised embeddings 23.5.2 Semi-supervised embeddings 23.6 Applications 23.6.1 Unsupervised applications 23.6.2 Supervised applications A Notation A.1 Introduction A.2 Common mathematical symbols A.3 Functions A.3.1 Common functions of one argument A.3.2 Common functions of two arguments A.3.3 Common functions of > 2 arguments A.4 Linear algebra A.4.1 General notation A.4.2 Vectors A.4.3 Matrices A.4.4 Matrix calculus A.5 Optimization A.6 Probability A.7 Information theory A.8 Statistics and machine learning A.8.1 Supervised learning A.8.2 Unsupervised learning and generative models A.8.3 Bayesian inference A.9 Abbreviations Index Bibliography