This leading textbook provides a comprehensive introduction to the fields of pattern recognition and machine learning. It is aimed at advanced undergraduates or first-year PhD students, as well as researchers and practitioners. No previous knowledge of pattern recognition or machine learning concepts is assumed. This is the first machine learning textbook to include a comprehensive coverage of recent developments such as probabilistic graphical models and deterministic inference methods, and to emphasize a modern Bayesian perspective. It is suitable for courses on machine learning, statistics, computer science, signal processing, computer vision, data mining, and bioinformatics. This hard cover book has 738 pages in full colour, and there are 431 graded exercises (with solutions available below). Extensive support is provided for course instructors.
Conditions of Use
This book is licensed under a Creative Commons License (CC BY-NC-SA). You can download the ebook Pattern Recognition and Machine Learning for free.
- Title
- Pattern Recognition and Machine Learning
- Publisher
- Springer
- Author(s)
- Christopher M. Bishop
- Published
- 2007-02-01
- Edition
- 1
- Format
- eBook (pdf, epub, mobi)
- Pages
- 798
- Language
- English
- ISBN-10
- 0387310738
- ISBN-13
- 9780387310732
- License
- CC BY-NC-SA
- Book Homepage
- Free eBook, Errata, Code, Solutions, etc.
COVER Preface Mathematical notation Contents 1. Introduction 1.1. Example: Polynomial Curve Fitting 1.2. Probability Theory 1.2.1 Probability densities 1.2.2 Expectations and covariances 1.2.3 Bayesian probabilities 1.2.4 The Gaussian distribution 1.2.5 Curve fitting re-visited 1.2.6 Bayesian curve fitting 1.3. Model Selection 1.4. The Curse of Dimensionality 1.5. Decision Theory 1.5.1 Minimizing the misclassification rate 1.5.2 Minimizing the expected loss 1.5.3 The reject option 1.5.4 Inference and decision 1.5.5 Loss functions for regression 1.6. Information Theory 1.6.1 Relative entropy and mutual information Exercises 2. Probability Distributions 2.1. Binary Variables 2.1.1 The beta distribution 2.2. Multinomial Variables 2.2.1 The Dirichlet distribution 2.3. The Gaussian Distribution 2.3.1 Conditional Gaussian distributions 2.3.2 Marginal Gaussian distributions 2.3.3 Bayes’ theorem for Gaussian variables 2.3.4 Maximum likelihood for the Gaussian 2.3.5 Sequential estimation 2.3.6 Bayesian inference for the Gaussian 2.3.7 Student’s t-distribution 2.3.8 Periodic variables 2.3.9 Mixtures of Gaussians 2.4. The Exponential Family 2.4.1 Maximum likelihood and sufficient statistics 2.4.2 Conjugate priors 2.4.3 Noninformative priors 2.5. Nonparametric Methods 2.5.1 Kernel density estimators 2.5.2 Nearest-neighbour methods Exercises 3. Linear Models for Regression 3.1. Linear Basis Function Models 3.1.1 Maximum likelihood and least squares 3.1.2 Geometry of least squares 3.1.3 Sequential learning 3.1.4 Regularized least squares 3.1.5 Multiple outputs 3.2. The Bias-Variance Decomposition 3.3. Bayesian Linear Regression 3.3.1 Parameter distribution 3.3.2 Predictive distribution 3.3.3 Equivalent kernel 3.4. Bayesian Model Comparison 3.5. The Evidence Approximation 3.5.1 Evaluation of the evidence function 3.5.2 Maximizing the evidence function 3.5.3 Effective number of parameters 3.6. Limitations of Fixed Basis Functions Exercises 4. Linear Models for Classification 4.1. Discriminant Functions 4.1.1 Two classes 4.1.2 Multiple classes 4.1.3 Least squares for classification 4.1.4 Fisher’s linear discriminant 4.1.5 Relation to least squares 4.1.6 Fisher’s discriminant for multiple classes 4.1.7 The perceptron algorithm 4.2. Probabilistic Generative Models 4.2.1 Continuous inputs 4.2.2 Maximum likelihood solution 4.2.3 Discrete features 4.2.4 Exponential family 4.3. Probabilistic Discriminative Models 4.3.1 Fixed basis functions 4.3.2 Logistic regression 4.3.3 Iterative reweighted least squares 4.3.4 Multiclass logistic regression 4.3.5 Probit regression 4.3.6 Canonical link functions 4.4. The Laplace Approximation 4.4.1 Model comparison and BIC 4.5. Bayesian Logistic Regression 4.5.1 Laplace approximation 4.5.2 Predictive distribution Exercises 5. Neural Networks 5.1. Feed-forward Network Functions 5.1.1 Weight-space symmetries 5.2. Network Training 5.2.1 Parameter optimization 5.2.2 Local quadratic approximation 5.2.3 Use of gradient information 5.2.4 Gradient descent optimization 5.3. Error Backpropagation 5.3.1 Evaluation of error-function derivatives 5.3.2 A simple example 5.3.3 Efficiency of backpropagation 5.3.4 The Jacobian matrix 5.4. The Hessian Matrix 5.4.1 Diagonal approximation 5.4.2 Outer product approximation 5.4.3 Inverse Hessian 5.4.4 Finite differences 5.4.5 Exact evaluation of the Hessian 5.4.6 Fast multiplication by the Hessian 5.5. Regularization in Neural Networks 5.5.1 Consistent Gaussian priors 5.5.2 Early stopping 5.5.3 Invariances 5.5.4 Tangent propagation 5.5.5 Training with transformed data 5.5.6 Convolutional networks 5.5.7 Soft weight sharing 5.6. Mixture Density Networks 5.7. Bayesian Neural Networks 5.7.1 Posterior parameter distribution 5.7.2 Hyperparameter optimization 5.7.3 Bayesian neural networks for classification Exercises 6. Kernel Methods 6.1. Dual Representations 6.2. Constructing Kernels 6.3. Radial Basis Function Networks 6.3.1 Nadaraya-Watson model 6.4. Gaussian Processes 6.4.1 Linear regression revisited 6.4.2 Gaussian processes for regression 6.4.3 Learning the hyperparameters 6.4.4 Automatic relevance determination 6.4.5 Gaussian processes for classification 6.4.6 Laplace approximation 6.4.7 Connection to neural networks Exercises 7. Sparse Kernel Machines 7.1. Maximum Margin Classifiers 7.1.1 Overlapping class distributions 7.1.2 Relation to logistic regression 7.1.3 Multiclass SVMs 7.1.4 SVMs for regression 7.1.5 Computational learning theory 7.2. Relevance Vector Machines 7.2.1 RVM for regression 7.2.2 Analysis of sparsity 7.2.3 RVM for classification 8. Graphical Models 8.1. Bayesian Networks 8.1.1 Example: Polynomial regression 8.1.2 Generative models 8.1.3 Discrete variables 8.1.4 Linear-Gaussian models 8.2. Conditional Independence 8.2.1 Three example graphs 8.2.2 D-separation 8.3. Markov Random Fields 8.3.1 Conditional independence properties 8.3.2 Factorization properties 8.3.3 Illustration: Image de-noising 8.3.4 Relation to directed graphs 8.4. Inference in Graphical Models 8.4.1 Inference on a chain 8.4.2 Trees 8.4.3 Factor graphs 8.4.4 The sum-product algorithm 8.4.5 The max-sum algorithm 8.4.6 Exact inference in general graphs 8.4.7 Loopy belief propagation 8.4.8 Learning the graph structure Exercises 9. Mixture Models and EM 9.1. K-means Clustering 9.1.1 Image segmentation and compression 9.2. Mixtures of Gaussians 9.2.1 Maximum likelihood 9.2.2 EM for Gaussian mixtures 9.3. An Alternative View of EM 9.3.1 Gaussian mixtures revisited 9.3.2 Relation to K-means 9.3.3 Mixtures of Bernoulli distributions 9.3.4 EM for Bayesian linear regression 9.4. The EM Algorithm in General Exercises 10. Approximate Inference 10.1. Variational Inference 10.1.1 Factorized distributions 10.1.2 Properties of factorized approximations 10.1.3 Example: The univariate Gaussian 10.1.4 Model comparison 10.2. Illustration: Variational Mixture of Gaussians 10.2.1 Variational distribution 10.2.2 Variational lower bound 10.2.3 Predictive density 10.2.4 Determining the number of components 10.2.5 Induced factorizations 10.3. Variational Linear Regression 10.3.1 Variational distribution 10.3.2 Predictive distribution 10.3.3 Lower bound 10.4. Exponential Family Distributions 10.4.1 Variational message passing 10.5. Local Variational Methods 10.6. Variational Logistic Regression 10.6.1 Variational posterior distribution 10.6.2 Optimizing the variational parameters 10.6.3 Inference of hyperparameters 10.7. Expectation Propagation 10.7.1 Example: The clutter problem 10.7.2 Expectation propagation on graphs Exercises 11. Sampling Methods 11.1. Basic Sampling Algorithms 11.1.1 Standard distributions 11.1.2 Rejection sampling 11.1.3 Adaptive rejection sampling 11.1.4 Importance sampling 11.1.5 Sampling-importance-resampling 11.1.6 Sampling and the EM algorithm 11.2. Markov Chain Monte Carlo 11.2.1 Markov chains 11.2.2 The Metropolis-Hastings algorithm 11.3. Gibbs Sampling 11.4. Slice Sampling 11.5. The Hybrid Monte Carlo Algorithm 11.5.1 Dynamical systems 11.5.2 Hybrid Monte Carlo 11.6. Estimating the Partition Function Exercises 12. Continuous Latent Variables 12.1. Principal Component Analysis 12.1.1 Maximum variance formulation 12.1.2 Minimum-error formulation 12.1.3 Applications of PCA 12.1.4 PCA for high-dimensional data 12.2. Probabilistic PCA 12.2.1 Maximum likelihood PCA 12.2.2 EM algorithm for PCA 12.2.3 Bayesian PCA 12.2.4 Factor analysis 12.3. Kernel PCA 12.4. Nonlinear Latent Variable Models 12.4.1 Independent component analysis 12.4.2 Autoassociative neural networks 12.4.3 Modelling nonlinear manifolds Exercises 13. Sequential Data 13.1. Markov Models 13.2. Hidden Markov Models 13.2.1 Maximum likelihood for the HMM 13.2.2 The forward-backward algorithm 13.2.3 The sum-product algorithm for the HMM 13.2.4 Scaling factors 13.2.5 The Viterbi algorithm 13.2.6 Extensions of the hidden Markov model 13.3. Linear Dynamical Systems 13.3.1 Inference in LDS 13.3.2 Learning in LDS 13.3.3 Extensions of LDS 13.3.4 Particle filters Exercises 14. Combining Models 14.1. Bayesian Model Averaging 14.2. Committees 14.3. Boosting 14.3.1 Minimizing exponential error 14.3.2 Error functions for boosting 14.4. Tree-based Models 14.5. Conditional Mixture Models 14.5.1 Mixtures of linear regression models 14.5.2 Mixtures of logistic models 14.5.3 Mixtures of experts Exercises Appendix A. Data Sets Appendix B. Probability Distributions Appendix C. Properties of Matrices Appendix D. Calculus of Variations Appendix E. Lagrange Multipliers References Index