This book is based on my five day course which I had the pleasure of teaching in the following Spanish cities: A Coruña, Algeciras, Alicante, Bilbao, Cáceres, Granada, Huesca, Jaén, Madrid, Málaga, Murcia, Sevilla, Valencia, Valladolid, and Zaragoza. I would like to take this opportunity to say a big thank you to all of my students ¡un gran placer!
This book uses the python programming language largely in conjunction with the scikit-learn machine learning library, and pandas for data manipulation. All example notebooks use the Jupyter Notebook environment.
Conditions of Use
This book is licensed under a Creative Commons License (CC BY). You can download the ebook The Orange Book of Machine Learning for free.
- Title
- The Orange Book of Machine Learning
- Subtitle
- The essentials of making predictions using supervised regression and classification for tabular data
- Publisher
- Leanpub
- Author(s)
- Carl McBride Ellis
- Published
- 2024-07-05
- Edition
- 1
- Format
- eBook (pdf, epub, mobi)
- Pages
- 135
- Language
- English
- License
- CC BY
- Book Homepage
- Free eBook, Errata, Code, Solutions, etc.
1 Introduction 1.1 The and the 1.2 Interpolation and curve fitting 1.3 Errors and residuals 1.4 Sources of uncertainty: aleatoric and epistemic 1.5 Confidence and prediction intervals 1.6 Explainability and interpretability 2 Statistics 2.1 Centrality: Mean, median, and mode 2.2 Dispersion: Variance, MAD, and quartiles 2.2.1 Quantiles, quartiles and the interquartile range (IQR) 2.3 Gaussian distribution: additive 2.3.1 Tests for normality 2.4 Chebyshev's inequality 2.5 Galton distribution: multiplicative 2.6 Skewness and kurtosis 3 Exploratory data analysis (EDA) 3.1 Data quality 3.2 Getting to know your dataframe 3.2.1 The curse of dimensionality 3.2.2 Descriptive statistics 3.3 Anscombe's quartet 3.4 Box, violin and raincloud plots 3.5 Outliers, inliers and extreme values 3.6 Correlation coefficients 3.6.1 Mutual information (MI) 3.7 Scatter plot 3.8 Histograms and eCDF 3.8.1 Kolmogorov-Smirnov test 3.9 Pairplots (or not) 4 Data cleaning 4.1 Missing values: NULL and NaN 4.1.1 Visualization of NaN with missingno 4.1.2 MCAR, MAR, and MNAR 4.1.3 Global fill 4.1.4 Global delete 4.1.5 Average value imputation 4.1.6 Multiple imputation 4.1.7 Do nothing! 4.1.8 Binary indicator column 4.2 Outliers and inliers 4.2.1 Outliers 4.2.2 Inliers: Isolation forest 4.3 Duplicated rows 4.4 Boolean columns 4.5 Zero variance columns 4.6 Feature scaling: standardization and normalization 4.7 Categorical features 4.7.1 Ordinal 4.7.2 Nominal 5 Cross-validation 5.1 Train test split 5.2 Cross-validation 5.3 Nested cross-validation 5.4 Data leakage 5.5 Covariate shift and Concept drift 6 Regression 6.1 Regression baseline model 6.2 Univariate linear regression 6.3 Calculating 1 and 6.3.1 Ordinary least squares 6.3.2 Normal equation 6.3.3 Scikit-learn LinearRegression 6.4 Assumptions of linear regression 6.5 Polynomial regression 6.6 Extrapolation 6.6.1 Convex hull 6.7 Explainability 6.8 The loss and cost functions 6.8.1 Gradient descent 6.9 Metrics 6.9.1 Root mean square error (RMSE) 6.9.2 Mean absolute error (MAE) 6.9.3 The R2 metric 6.10 Decision tree regressor 6.10.1 Hyperparameter: max_depth 6.11 Overfitting 6.11.1 Parametric models: regularization 6.11.2 Tree models: min_samples_leaf 6.12 Quantile regression 6.12.1 Pinball loss function 6.13 Conformal prediction intervals 6.13.1 Conformalized quantile regression (CQR) 6.13.2 Locally-weighted conformal regression 6.13.3 Prediction interval metric: Winkler interval score 6.14 Summary 7 Classification 7.1 Logistic regression 7.1.1 Explainability 7.2 Log-loss function 7.3 Decision tree classifier 7.4 Classification baseline model 7.5 Classification metrics 7.5.1 Strictly proper scoring rules 7.5.2 Accuracy score 7.5.3 Confusion matrix 7.5.4 Precision and recall 7.5.5 Decision threshold 7.5.6 AUC ROC 7.6 Imbalanced classification 7.6.1 What to do about imbalanced data? 7.7 Overfitting 7.8 No free lunch theorem 7.9 Classifier calibration 7.9.1 Reliability diagrams 7.9.2 Venn-ABERS calibration 7.10 Multiclass classification 7.10.1 Multiclass metrics 8 Ensemble estimators 8.1 Random Forest 8.1.1 Bootstrapping: row subsampling with replacement 8.1.2 Feature subsampling 8.1.3 Results 8.2 Weak learners and boosting 8.2.1 AdaBoost (Adaptive Boosting) 8.3 Gradient boosted decision trees (GBDT) 8.3.1 Extrapolation 8.4 Convex combination of model predictions (CCMP) 8.5 Stacking 9 Hyperparameter optimization 10 Feature engineering and selection 10.1 Feature engineering 10.1.1 Interaction and cross features 10.1.2 Bucketing of continuous features 10.1.3 Power transforms: Yeo-Johnson 10.1.4 User defined transform 10.1.5 External secondary features 10.2 Feature selection 10.2.1 Correlation 10.2.2 Permutation importance 10.2.3 Stepwise regression 10.2.4 LASSO 10.2.5 Boruta trick 10.2.6 Native feature importance plots 10.3 Principal component analysis (PCA) 11 Why no neural networks/deep learning? 11.0.1 Single neuron regressor 11.0.2 Single neuron classifier Essential reading