Foundations of Machine Learning, 2nd Edition

A new edition of a graduate-level machine learning textbook that focuses on the analysis and theory of algorithms.

This book is a general introduction to machine learning that can serve as a textbook for graduate students and a reference for researchers. It covers fundamental modern topics in machine learning while providing the theoretical basis and conceptual tools needed for the discussion and justification of algorithms. It also describes several key aspects of the application of these algorithms. The authors aim to present novel theoretical tools and concepts while giving concise proofs even for relatively advanced topics.

Foundations of Machine Learning is unique in its focus on the analysis and theory of algorithms. The first four chapters lay the theoretical foundation for what follows; subsequent chapters are mostly self-contained. Topics covered include the Probably Approximately Correct (PAC) learning framework; generalization bounds based on Rademacher complexity and VC-dimension; Support Vector Machines (SVMs); kernel methods; boosting; on-line learning; multi-class classification; ranking; regression; algorithmic stability; dimensionality reduction; learning automata and languages; and reinforcement learning. Each chapter ends with a set of exercises. Appendixes provide additional material including concise probability review.

This second edition offers three new chapters, on model selection, maximum entropy models, and conditional entropy models. New material in the appendixes includes a major section on Fenchel duality, expanded coverage of concentration

Conditions of Use

This book is licensed under a Creative Commons License (CC BY-NC-SA). You can download the ebook Foundations of Machine Learning, 2nd Edition for free.

Title: Foundations of Machine Learning, 2nd Edition
Publisher: The MIT Press
Author(s): Afshin Rostamizadeh, Ameet Talwalkar, Francis Bach, Mehryar Mohri
Published: 2019-04-02
Edition: 2
Format: eBook (pdf, epub, mobi)
Pages: 504
Language: English
ISBN-10: 0262039400
ISBN-13: 9780262039406
License: CC BY-NC-SA
Book Homepage: Free eBook, Errata, Code, Solutions, etc.

Contents
Preface
1 Introduction
	1.1 What is machine learning?
	1.2 What kind of problems can be tackled using machine learning?
	1.3 Some standard learning tasks
	1.4 Learning stages
	1.5 Learning scenarios
	1.6 Generalization
2 The PAC Learning Framework
	2.1 The PAC learning model
	2.2 Guarantees for finite hypothesis sets — consistent case
	2.3 Guarantees for finite hypothesis sets — inconsistent case
	2.4 Generalities
		2.4.1 Deterministic versus stochastic scenarios
		2.4.2 Bayes error and noise
	2.5 Chapter notes
	2.6 Exercises
3 Rademacher Complexity and VC-Dimension
	3.1 Rademacher complexity
	3.2 Growth function
	3.3 VC-dimension
	3.4 Lower bounds
	3.5 Chapter notes
	3.6 Exercises
4 Model Selection
	4.1 Estimation and approximation errors
	4.2 Empirical risk minimization (ERM)
	4.3 Structural risk minimization (SRM)
	4.4 Cross-validation
	4.5 n-Fold cross-validation
	4.6 Regularization-based algorithms
	4.7 Convex surrogate losses
	4.8 Chapter notes
	4.9 Exercises
5 Support Vector Machines
	5.1 Linear classification
	5.2 Separable case
		5.2.1 Primal optimization problem
		5.2.2 Support vectors
		5.2.3 Dual optimization problem
		5.2.4 Leave-one-out analysis
	5.3 Non-separable case
		5.3.1 Primal optimization problem
		5.3.2 Support vectors
		5.3.3 Dual optimization problem
	5.4 Margin theory
	5.5 Chapter notes
	5.6 Exercises
6 Kernel Methods
	6.1 Introduction
	6.2 Positive definite symmetric kernels
		6.2.1 Definitions
		6.2.2 Reproducing kernel Hilbert space
		6.2.3 Properties
	6.3 Kernel-based algorithms
		6.3.1 SVMs with PDS kernels
		6.3.2 Representer theorem
		6.3.3 Learning guarantees
	6.4 Negative definite symmetric kernels
	6.5 Sequence kernels
		6.5.1 Weighted transducers
		6.5.2 Rational kernels
	6.6 Approximate kernel feature maps
	6.7 Chapter notes
	6.8 Exercises
7 Boosting
	7.1 Introduction
	7.2 AdaBoost
		7.2.1 Bound on the empirical error
		7.2.2 Relationship with coordinate descent
		7.2.3 Practical use
	7.3 Theoretical results
		7.3.1 VC-dimension-based analysis
		7.3.2 L1-geometric margin
		7.3.3 Margin-based analysis
		7.3.4 Margin maximization
		7.3.5 Game-theoretic interpretation
	7.4 L1-regularization
	7.5 Discussion
	7.6 Chapter notes
	7.7 Exercises
8 On-Line Learning
	8.1 Introduction
	8.2 Prediction with expert advice
		8.2.1 Mistake bounds and Halving algorithm
		8.2.2 Weighted majority algorithm
		8.2.3 Randomized weighted majority algorithm
		8.2.4 Exponential weighted average algorithm
	8.3 Linear classification
		8.3.1 Perceptron algorithm
		8.3.2 Winnow algorithm
	8.4 On-line to batch conversion
	8.5 Game-theoretic connection
	8.6 Chapter notes
	8.7 Exercises
9 Multi-Class Classification
	9.1 Multi-class classification problem
	9.2 Generalization bounds
	9.3 Uncombined multi-class algorithms
		9.3.1 Multi-class SVMs
		9.3.2 Multi-class boosting algorithms
		9.3.3 Decision trees
	9.4 Aggregated multi-class algorithms
		9.4.1 One-versus-all
		9.4.2 One-versus-one
		9.4.3 Error-correcting output codes
	9.5 Structured prediction algorithms
	9.6 Chapter notes
	9.7 Exercises
10 Ranking
	10.1 The problem of ranking
	10.2 Generalization bound
	10.3 Ranking with SVMs
	10.4 RankBoost
		10.4.1 Bound on the empirical error
		10.4.2 Relationship with coordinate descent
		10.4.3 Margin bound for ensemble methods in ranking
	10.5 Bipartite ranking
		10.5.1 Boosting in bipartite ranking
		10.5.2 Area under the ROC curve
	10.6 Preference-based setting
		10.6.1 Second-stage ranking problem
		10.6.2 Deterministic algorithm
		10.6.3 Randomized algorithm
		10.6.4 Extension to other loss functions
	10.7 Other ranking criteria
	10.8 Chapter notes
	10.9 Exercises
11 Regression
	11.1 The problem of regression
	11.2 Generalization bounds
		11.2.1 Finite hypothesis sets
		11.2.2 Rademacher complexity bounds
		11.2.3 Pseudo-dimension bounds
	11.3 Regression algorithms
		11.3.1 Linear regression
		11.3.2 Kernel ridge regression
		11.3.3 Support vector regression
		11.3.4 Lasso
		11.3.5 Group norm regression algorithms
		11.3.6 On-line regression algorithms
	11.4 Chapter notes
	11.5 Exercises
12 Maximum Entropy Models
	12.1 Density estimation problem
		12.1.1 Maximum Likelihood (ML) solution
		12.1.2 Maximum a Posteriori (MAP) solution
	12.2 Density estimation problem augmented with features
	12.3 Maxent principle
	12.4 Maxent models
	12.5 Dual problem
	12.6 Generalization bound
	12.7 Coordinate descent algorithm
	12.8 Extensions
	12.9 L2-regularization
	12.10 Chapter notes
	12.11 Exercises
13 Conditional Maximum Entropy Models
	13.1 Learning problem
	13.2 Conditional Maxent principle
	13.3 Conditional Maxent models
	13.4 Dual problem
	13.5 Properties
		13.5.1 Optimization problem
		13.5.2 Feature vectors
		13.5.3 Prediction
	13.6 Generalization bounds
	13.7 Logistic regression
		13.7.1 Optimization problem
		13.7.2 Logistic model
	13.8 L2-regularization
	13.9 Proof of the duality theorem
	13.10 Chapter notes
	13.11 Exercises
14 Algorithmic Stability
	14.1 Definitions
	14.2 Stability-based generalization guarantee
	14.3 Stability of kernel-based regularization algorithms
		14.3.1 Application to regression algorithms: SVR and KRR
		14.3.2 Application to classification algorithms: SVMs
		14.3.3 Discussion
	14.4 Chapter notes
	14.5 Exercises
15 Dimensionality Reduction
	15.1 Principal component analysis
	15.2 Kernel principal component analysis (KPCA)
	15.3 KPCA and manifold learning
		15.3.1 Isomap
		15.3.2 Laplacian eigenmaps
		15.3.3 Locally linear embedding (LLE)
	15.4 Johnson-Lindenstrauss lemma
	15.5 Chapter notes
	15.6 Exercises
16 Learning Automata and Languages
	16.1 Introduction
	16.2 Finite automata
	16.3 Efficient exact learning
		16.3.1 Passive learning
		16.3.2 Learning with queries
		16.3.3 Learning automata with queries
	16.4 Identification in the limit
		16.4.1 Learning reversible automata
	16.5 Chapter notes
	16.6 Exercises
17 Reinforcement Learning
	17.1 Learning scenario
	17.2 Markov decision process model
	17.3 Policy
		17.3.1 Definition
		17.3.2 Policy value
		17.3.3 Optimal policies
		17.3.4 Policy evaluation
	17.4 Planning algorithms
		17.4.1 Value iteration
		17.4.2 Policy iteration
		17.4.3 Linear programming
	17.5 Learning algorithms
		17.5.1 Stochastic approximation
		17.5.2 TD(0) algorithm
		17.5.3 Q-learning algorithm
		17.5.4 SARSA
		17.5.5 TD(-) algorithm
		17.5.6 Large state space
	17.6 Chapter notes
Conclusion
A Linear Algebra Review
	A.1 Vectors and norms
		A.1.1 Norms
		A.1.2 Dual norms
		A.1.3 Relationship between norms
	A.2 Matrices
		A.2.1 Matrix norms
		A.2.2 Singular value decomposition
		A.2.3 Symmetric positive semidefinite (SPSD) matrices
B Convex Optimization
	B.1 Differentiation and unconstrained optimization
	B.2 Convexity
	B.3 Constrained optimization
	B.4 Fenchel duality
		B.4.1 Subgradients
		B.4.2 Core
		B.4.3 Conjugate functions
	B.5 Chapter notes
	B.6 Exercises
C Probability Review
	C.1 Probability
	C.2 Random variables
	C.3 Conditional probability and independence
	C.4 Expectation and Markov’s inequality
	C.5 Variance and Chebyshev’s inequality
	C.6 Moment-generating functions
	C.7 Exercises
D Concentration Inequalities
	D.1 Hoeffding’s inequality
	D.2 Sanov’s theorem
	D.3 Multiplicative Chernoff bounds
	D.4 Binomial distribution tails: Upper bounds
	D.5 Binomial distribution tails: Lower bound
	D.6 Azuma’s inequality
	D.7 McDiarmid’s inequality
	D.8 Normal distribution tails: Lower bound
	D.9 Khintchine-Kahane inequality
	D.10 Maximal inequality
	D.11 Chapter notes
	D.12 Exercises
E Notions of Information Theory
	E.1 Entropy
	E.2 Relative entropy
	E.3 Mutual information
	E.4 Bregman divergences
	E.5 Chapter notes
	E.6 Exercises
F Notation
Bibliography
Index