This book illustrates the fundamental concepts that link statistics and machine learning, so that the reader can not only employ statistical and machine learning models using modern Python modules, but also understand their relative strengths and weaknesses.
Conditions of Use
This book is licensed under a Creative Commons License (CC BY-NC-SA). You can download the ebook Statistics and Machine Learning in Python for free.
- Title
- Statistics and Machine Learning in Python
- Author(s)
- Edouard Duchesnay, Feki Younes, Tommy Löfstedt
- Published
- 2024-04-15
- Edition
- 1
- Format
- eBook (pdf, epub, mobi)
- Pages
- 399
- Language
- English
- License
- CC BY-NC-SA
- Book Homepage
- Free eBook, Errata, Code, Solutions, etc.
Introduction Introduction to Python language Python main features Development process Python ecosystem for data-science Development with Integrated Development Environment (IDE) and JupyterLab Visual Studio Code (VS Code) Spyder JupyterLab (Jupyter Notebook) Anaconda and Conda environments Installation Conda environments Miniconda Additional packages with pip Python language Import libraries Basic operations Data types Lists Tuples Strings Dictionaries Sets Execution control statements Conditional statements Loops Example, use loop, dictionary and set to count words in a sentence List comprehensions, iterators, etc. List comprehensions Set comprehension Dictionary comprehension Iterators itertools package Exceptions handling Functions Reference and copy Example: function, and dictionary comprehension Regular expression System programming Operating system interfaces (os) File input/output Explore, list directories Command execution with subprocess Multiprocessing and multithreading Scripts and argument parsing Networking FTP HTTP Sockets xmlrpc Object Oriented Programming (OOP) Style guide for Python programming Documenting Modules and packages Package Module The search path Unit testing unittest: test your code Doctest: add unit tests in docstring Exercises Exercise 1: functions Exercise 2: functions + list + loop Exercise 3: File I/O Exercise 4: OOP Scientific Python Numpy: arrays and matrices Create arrays Examining arrays Reshaping Summary on axis, reshaping/flattening and selection Stack arrays Selection Slicing Fancy indexing: Integer or boolean array indexing Vectorized operations Broadcasting Rules Exercises Pandas: data manipulation Create DataFrame Combining DataFrames Concatenate DataFrame Join DataFrame Reshaping by pivoting Summarizing Columns selection Rows selection (basic) Sorting Rows iteration Rows selection (filtering) Sorting Descriptive statistics Quality check Remove duplicate data Missing data Operation: multiplication Renaming Dealing with outliers Based on parametric statistics: use the mean Based on non-parametric statistics: use the median File I/O csv Read csv from url Excel SQL (SQLite) Exercises Data Frame Missing data Data visualization: matplotlib & seaborn Basic plots Scatter (2D) plots Simple scatter with colors Linear model Scatter plot with colors and symbols Saving Figures Boxplot and violin plot: one factor Boxplot and violin plot: two factors Distributions and density plot Multiple axis Pairwise scatter plots Time series Statistics Univariate statistics Libraries Estimators of the main statistical measures Mean Variance Standard deviation Covariance Correlation Standard Error (SE) Descriptives statistics with numpy Descriptives statistics on Iris dataset Main distributions Normal distribution The Chi-Square distribution The Fisher’s F-distribution The Student’s t-distribution Hypothesis Testing Flip coin: Simplified example Flip coin: Real Example One sample t-test Assumptions 1 Model the data 2 Fit: estimate the model parameters 3 Compute a test statistic 4 Compute the probability of the test statistic under the null hypotheis. This require to have the distribution of the t statistic under H0. Example Testing pairwise associations Pearson correlation test: test association between two quantitative variables Two sample (Student) t-test: compare two means Assumptions 1. Model the data 2. Fit: estimate the model parameters 3. t-test Equal or unequal sample sizes, unequal variances (Welch’s t-test) Equal or unequal sample sizes, equal variances Equal sample sizes, equal variances Example ANOVA F-test (quantitative ~ categorial (>=2 levels)) Assumptions 1. Model the data 2. Fit: estimate the model parameters 3. F-test Chi-square, 2 (categorial ~ categorial) Non-parametric test of pairwise associations Spearman rank-order correlation (quantitative ~ quantitative) Wilcoxon signed-rank test (quantitative ~ cte) Mann–Whitney U test (quantitative ~ categorial (2 levels)) Linear model Assumptions Simple regression: test association between two quantitative variables 1. Model the data 2. Fit: estimate the model parameters Multiple regression Theory Simulated dataset where: Fit with numpy Linear model with statsmodels Multiple regression Interface with statsmodels without formulae (sm) Statsmodels with Pandas using formulae (smf) Multiple regression with categorical independent variables or factors: Analysis of covariance (ANCOVA) One-way AN(C)OVA Two-way AN(C)OVA Comparing two nested models Factor coding Contrasts and post-hoc tests Multiple comparisons Bonferroni correction for multiple comparisons The False discovery rate (FDR) correction for multiple comparisons Lab: Brain volumes study Manipulate data Descriptive Statistics Statistics Linear Mixed Models Introduction Clustered/structured datasets Mixed effects = fixed + random effects Random intercept Global fixed effect Model a classroom intercept as a fixed effect: ANCOVA Aggregation of data into independent units Hierarchical/multilevel modeling Model the classroom random intercept: linear mixed model Random slope Model the classroom intercept and slope as a fixed effect: ANCOVA with interactions Model the classroom random intercept and slope with LMM Conclusion on modeling random effects Theory of Linear Mixed Models Checking model assumptions (Diagnostics) References Multivariate statistics Linear Algebra Euclidean norm and distance Dot product and projection Mean vector Covariance matrix Correlation matrix Precision matrix Mahalanobis distance Multivariate normal distribution Exercises Dot product and Euclidean norm Covariance matrix and Mahalanobis norm Time series in python Stationarity Pandas time series data structure Time series analysis of Google trends Read data Recode data Exploratory data analysis Resampling, smoothing, windowing, rolling average: trends First-order differencing: seasonal patterns Periodicity and correlation Autocorrelation Time series forecasting with Python using Autoregressive Moving Average (ARMA) models Choosing p and q Fit ARMA model with statsmodels Machine Learning Linear dimension reduction and feature extraction Introduction Singular value decomposition and matrix factorization Matrix factorization principles Singular value decomposition (SVD) principles SVD for variables transformation Principal components analysis (PCA) Principles Dataset preprocessing Centering Standardizing Eigendecomposition of the data covariance matrix Back to SVD PCA outputs Determining the number of PCs Interpretation and visualization Eigen faces Exercises Write a basic PCA class Apply your Basic PCA on the iris dataset Run scikit-learn examples Manifold learning: non-linear dimension reduction Multi-dimensional Scaling (MDS) Classical multidimensional scaling Example Determining the number of components Exercises Isomap t-SNE Exercises Clustering K-means clustering Exercises 1. Analyse clusters 2. Re-implement the K-means clustering algorithm (homework) Gaussian mixture models Model selection Bayesian information criterion Hierarchical clustering Ward clustering Exercises Linear models for regression problems Ordinary least squares Linear regression with scikit-learn Overfitting Model complexity Multicollinearity High dimensionality Regularization using penalization of coefficients Ridge regression (2-regularization) Lasso regression (1-regularization) Sparsity of the 1 norm Occam’s razor Principle of parsimony Sparsity-induced penalty or embedded feature selection with the 1 penalty Optimization issues Elastic-net regression (1-2-regularization) Rational Regression performance evaluation metrics: R-squared, MSE and MAE R-squared Linear models for classification problems Fisher’s linear discriminant with equal class covariance The Fisher most discriminant projection Demonstration The separating hyperplane Linear discriminant analysis (LDA) Exercise Logistic regression Exercise Losses Negative log likelihood or cross-entropy Hinge loss or 1 loss Overfitting Regularization using penalization of coefficients Ridge Fisher’s linear classification (2-regularization) Ridge logistic regression (2-regularization) Lasso logistic regression (1-regularization) Ridge linear Support Vector Machine (2-regularization) Lasso linear Support Vector Machine (1-regularization) Elastic-net classification (12-regularization) Classification performance evaluation metrics Area Under Curve (AUC) of Receiver operating characteristic (ROC) Imbalanced classes Confidence interval cross-validation Significance of classification metrics Exercise Fisher linear discriminant rule Non-linear models Support Vector Machines (SVM) Random forest Decision tree Forest Gradient boosting Resampling methods Train, validation and test sets Split dataset in train/test sets for model evaluation Train/validation/test splits: model selection and model evaluation Cross-Validation (CV) CV for regression CV for classification: stratifiy for the target label Cross-validation for model selection Cross-validation for both model (outer) evaluation and model (inner) selection Models with built-in cross-validation Random Permutations: sample the null distribution Random permutations Bootstrapping Parallel computation with joblib Parallel computation with joblib Ensemble learning: bagging, boosting and stacking Single weak learner Bagging Boosting 1/ Adaptative boosting 2/ Gradient boosting Overview of stacking Gradient descent Introduction Learning rate Cost function Numerical solution for gradient descent Gradient descent variants Batch gradient descent Stochastic gradient descent Mini-batch gradient descent Gradient Descent challenges Gradient descent optimization algorithms Momentum AdaGrad: adaptive learning rates RMSProp: “Leaky AdaGrad” Nesterov accelerated gradient Adam Lab: Faces recognition using various learning models Utils Download the data Split into a training and testing set in stratified way Eigenfaces LogisticRegression with L2 penalty (with CV-based model selection) SVM (with CV-based model selection) MLP with sklearn and CV-based model selection MLP with pytorch and no model selection Univariate feature filtering (Anova) with Logistic-L2 PCA with LogisticRegression with L2 regularization Basic ConvNet ConvNet with Resnet18 Deep Learning Backpropagation Course outline: Backpropagation and chaine rule Chaine rule Recap: Vector derivatives Backpropagation summary Lab: with numpy and pytorch Load iris data set Backpropagation with numpy Backpropagation with PyTorch Tensors Backpropagation with PyTorch: Tensors and autograd Backpropagation with PyTorch: nn Backpropagation with PyTorch optim Multilayer Perceptron (MLP) Course outline: Dataset: MNIST Handwritten Digit Recognition Recall of linear classifier Binary logistic regression Softmax Classifier (Multinomial Logistic Regression) Model: Two Layer MLP MLP with Scikit-learn MLP with pytorch Train the Model Continue training from checkpoints: reload the model and run 10 more epochs Test several MLP architectures Reduce the size of training dataset Run MLP on CIFAR-10 dataset Convolutional neural network Outline Architectures LeNet AlexNet Architecures general guidelines Train function CNN models LeNet-5 VGGNet like: conv-relu blocks ResNet-like Model: MNIST digit classification LeNet MiniVGGNet Reduce the size of training dataset CIFAR-10 dataset LeNet MiniVGGNet ResNet Transfer Learning Tutorial Training function CIFAR-10 dataset Finetuning the convnet ResNet as a feature extractor Indices and tables
Related Books