A new edition of a graduate-level machine learning textbook that focuses on the analysis and theory of algorithms.
This book is a general introduction to machine learning that can serve as a textbook for graduate students and a reference for researchers. It covers fundamental modern topics in machine learning while providing the theoretical basis and conceptual tools needed for the discussion and justification of algorithms. It also describes several key aspects of the application of these algorithms. The authors aim to present novel theoretical tools and concepts while giving concise proofs even for relatively advanced topics.
Foundations of Machine Learning is unique in its focus on the analysis and theory of algorithms. The first four chapters lay the theoretical foundation for what follows; subsequent chapters are mostly self-contained. Topics covered include the Probably Approximately Correct (PAC) learning framework; generalization bounds based on Rademacher complexity and VC-dimension; Support Vector Machines (SVMs); kernel methods; boosting; on-line learning; multi-class classification; ranking; regression; algorithmic stability; dimensionality reduction; learning automata and languages; and reinforcement learning. Each chapter ends with a set of exercises. Appendixes provide additional material including concise probability review.
This second edition offers three new chapters, on model selection, maximum entropy models, and conditional entropy models. New material in the appendixes includes a major section on Fenchel duality, expanded coverage of concentration
Conditions of Use
This book is licensed under a Creative Commons License (CC BY-NC-SA). You can download the ebook Foundations of Machine Learning, 2nd Edition for free.
- Title
- Foundations of Machine Learning, 2nd Edition
- Publisher
- The MIT Press
- Author(s)
- Afshin Rostamizadeh, Ameet Talwalkar, Francis Bach, Mehryar Mohri
- Published
- 2019-04-02
- Edition
- 2
- Format
- eBook (pdf, epub, mobi)
- Pages
- 504
- Language
- English
- ISBN-10
- 0262039400
- ISBN-13
- 9780262039406
- License
- CC BY-NC-SA
- Book Homepage
- Free eBook, Errata, Code, Solutions, etc.
Contents Preface 1 Introduction 1.1 What is machine learning? 1.2 What kind of problems can be tackled using machine learning? 1.3 Some standard learning tasks 1.4 Learning stages 1.5 Learning scenarios 1.6 Generalization 2 The PAC Learning Framework 2.1 The PAC learning model 2.2 Guarantees for finite hypothesis sets — consistent case 2.3 Guarantees for finite hypothesis sets — inconsistent case 2.4 Generalities 2.4.1 Deterministic versus stochastic scenarios 2.4.2 Bayes error and noise 2.5 Chapter notes 2.6 Exercises 3 Rademacher Complexity and VC-Dimension 3.1 Rademacher complexity 3.2 Growth function 3.3 VC-dimension 3.4 Lower bounds 3.5 Chapter notes 3.6 Exercises 4 Model Selection 4.1 Estimation and approximation errors 4.2 Empirical risk minimization (ERM) 4.3 Structural risk minimization (SRM) 4.4 Cross-validation 4.5 n-Fold cross-validation 4.6 Regularization-based algorithms 4.7 Convex surrogate losses 4.8 Chapter notes 4.9 Exercises 5 Support Vector Machines 5.1 Linear classification 5.2 Separable case 5.2.1 Primal optimization problem 5.2.2 Support vectors 5.2.3 Dual optimization problem 5.2.4 Leave-one-out analysis 5.3 Non-separable case 5.3.1 Primal optimization problem 5.3.2 Support vectors 5.3.3 Dual optimization problem 5.4 Margin theory 5.5 Chapter notes 5.6 Exercises 6 Kernel Methods 6.1 Introduction 6.2 Positive definite symmetric kernels 6.2.1 Definitions 6.2.2 Reproducing kernel Hilbert space 6.2.3 Properties 6.3 Kernel-based algorithms 6.3.1 SVMs with PDS kernels 6.3.2 Representer theorem 6.3.3 Learning guarantees 6.4 Negative definite symmetric kernels 6.5 Sequence kernels 6.5.1 Weighted transducers 6.5.2 Rational kernels 6.6 Approximate kernel feature maps 6.7 Chapter notes 6.8 Exercises 7 Boosting 7.1 Introduction 7.2 AdaBoost 7.2.1 Bound on the empirical error 7.2.2 Relationship with coordinate descent 7.2.3 Practical use 7.3 Theoretical results 7.3.1 VC-dimension-based analysis 7.3.2 L1-geometric margin 7.3.3 Margin-based analysis 7.3.4 Margin maximization 7.3.5 Game-theoretic interpretation 7.4 L1-regularization 7.5 Discussion 7.6 Chapter notes 7.7 Exercises 8 On-Line Learning 8.1 Introduction 8.2 Prediction with expert advice 8.2.1 Mistake bounds and Halving algorithm 8.2.2 Weighted majority algorithm 8.2.3 Randomized weighted majority algorithm 8.2.4 Exponential weighted average algorithm 8.3 Linear classification 8.3.1 Perceptron algorithm 8.3.2 Winnow algorithm 8.4 On-line to batch conversion 8.5 Game-theoretic connection 8.6 Chapter notes 8.7 Exercises 9 Multi-Class Classification 9.1 Multi-class classification problem 9.2 Generalization bounds 9.3 Uncombined multi-class algorithms 9.3.1 Multi-class SVMs 9.3.2 Multi-class boosting algorithms 9.3.3 Decision trees 9.4 Aggregated multi-class algorithms 9.4.1 One-versus-all 9.4.2 One-versus-one 9.4.3 Error-correcting output codes 9.5 Structured prediction algorithms 9.6 Chapter notes 9.7 Exercises 10 Ranking 10.1 The problem of ranking 10.2 Generalization bound 10.3 Ranking with SVMs 10.4 RankBoost 10.4.1 Bound on the empirical error 10.4.2 Relationship with coordinate descent 10.4.3 Margin bound for ensemble methods in ranking 10.5 Bipartite ranking 10.5.1 Boosting in bipartite ranking 10.5.2 Area under the ROC curve 10.6 Preference-based setting 10.6.1 Second-stage ranking problem 10.6.2 Deterministic algorithm 10.6.3 Randomized algorithm 10.6.4 Extension to other loss functions 10.7 Other ranking criteria 10.8 Chapter notes 10.9 Exercises 11 Regression 11.1 The problem of regression 11.2 Generalization bounds 11.2.1 Finite hypothesis sets 11.2.2 Rademacher complexity bounds 11.2.3 Pseudo-dimension bounds 11.3 Regression algorithms 11.3.1 Linear regression 11.3.2 Kernel ridge regression 11.3.3 Support vector regression 11.3.4 Lasso 11.3.5 Group norm regression algorithms 11.3.6 On-line regression algorithms 11.4 Chapter notes 11.5 Exercises 12 Maximum Entropy Models 12.1 Density estimation problem 12.1.1 Maximum Likelihood (ML) solution 12.1.2 Maximum a Posteriori (MAP) solution 12.2 Density estimation problem augmented with features 12.3 Maxent principle 12.4 Maxent models 12.5 Dual problem 12.6 Generalization bound 12.7 Coordinate descent algorithm 12.8 Extensions 12.9 L2-regularization 12.10 Chapter notes 12.11 Exercises 13 Conditional Maximum Entropy Models 13.1 Learning problem 13.2 Conditional Maxent principle 13.3 Conditional Maxent models 13.4 Dual problem 13.5 Properties 13.5.1 Optimization problem 13.5.2 Feature vectors 13.5.3 Prediction 13.6 Generalization bounds 13.7 Logistic regression 13.7.1 Optimization problem 13.7.2 Logistic model 13.8 L2-regularization 13.9 Proof of the duality theorem 13.10 Chapter notes 13.11 Exercises 14 Algorithmic Stability 14.1 Definitions 14.2 Stability-based generalization guarantee 14.3 Stability of kernel-based regularization algorithms 14.3.1 Application to regression algorithms: SVR and KRR 14.3.2 Application to classification algorithms: SVMs 14.3.3 Discussion 14.4 Chapter notes 14.5 Exercises 15 Dimensionality Reduction 15.1 Principal component analysis 15.2 Kernel principal component analysis (KPCA) 15.3 KPCA and manifold learning 15.3.1 Isomap 15.3.2 Laplacian eigenmaps 15.3.3 Locally linear embedding (LLE) 15.4 Johnson-Lindenstrauss lemma 15.5 Chapter notes 15.6 Exercises 16 Learning Automata and Languages 16.1 Introduction 16.2 Finite automata 16.3 Efficient exact learning 16.3.1 Passive learning 16.3.2 Learning with queries 16.3.3 Learning automata with queries 16.4 Identification in the limit 16.4.1 Learning reversible automata 16.5 Chapter notes 16.6 Exercises 17 Reinforcement Learning 17.1 Learning scenario 17.2 Markov decision process model 17.3 Policy 17.3.1 Definition 17.3.2 Policy value 17.3.3 Optimal policies 17.3.4 Policy evaluation 17.4 Planning algorithms 17.4.1 Value iteration 17.4.2 Policy iteration 17.4.3 Linear programming 17.5 Learning algorithms 17.5.1 Stochastic approximation 17.5.2 TD(0) algorithm 17.5.3 Q-learning algorithm 17.5.4 SARSA 17.5.5 TD(-) algorithm 17.5.6 Large state space 17.6 Chapter notes Conclusion A Linear Algebra Review A.1 Vectors and norms A.1.1 Norms A.1.2 Dual norms A.1.3 Relationship between norms A.2 Matrices A.2.1 Matrix norms A.2.2 Singular value decomposition A.2.3 Symmetric positive semidefinite (SPSD) matrices B Convex Optimization B.1 Differentiation and unconstrained optimization B.2 Convexity B.3 Constrained optimization B.4 Fenchel duality B.4.1 Subgradients B.4.2 Core B.4.3 Conjugate functions B.5 Chapter notes B.6 Exercises C Probability Review C.1 Probability C.2 Random variables C.3 Conditional probability and independence C.4 Expectation and Markov’s inequality C.5 Variance and Chebyshev’s inequality C.6 Moment-generating functions C.7 Exercises D Concentration Inequalities D.1 Hoeffding’s inequality D.2 Sanov’s theorem D.3 Multiplicative Chernoff bounds D.4 Binomial distribution tails: Upper bounds D.5 Binomial distribution tails: Lower bound D.6 Azuma’s inequality D.7 McDiarmid’s inequality D.8 Normal distribution tails: Lower bound D.9 Khintchine-Kahane inequality D.10 Maximal inequality D.11 Chapter notes D.12 Exercises E Notions of Information Theory E.1 Entropy E.2 Relative entropy E.3 Mutual information E.4 Bregman divergences E.5 Chapter notes E.6 Exercises F Notation Bibliography Index