Minimalist Data Wrangling with Python is envisaged as a student's first introduction to data science, providing a high-level overview as well as discussing key concepts in detail. We explore methods for cleaning data gathered from different sources, transforming, selecting, and extracting features, performing exploratory data analysis and dimensionality reduction, identifying naturally occurring data clusters, modelling patterns in data, comparing data between groups, and reporting the results.
Conditions of Use
This book is licensed under a Creative Commons License (CC BY-NC-SA). You can download the ebook Minimalist Data Wrangling with Python for free.
- Title
- Minimalist Data Wrangling with Python
- Publisher
- Zenodo
- Author(s)
- Marek Gagolewski
- Published
- 2022-08-24
- Edition
- 1
- Format
- eBook (pdf, epub, mobi)
- Pages
- 442
- Language
- English
- ISBN-10
- 0645571911
- ISBN-13
- 9780645571912
- License
- CC BY-NC-SA
- Book Homepage
- Free eBook, Errata, Code, Solutions, etc.
Start here Preface The art of data wrangling Aims, scope, and design philosophy Structure The Rules About the author Acknowledgements You can make this book better Introducing Python 1. Getting started with Python 1.1. Installing Python 1.2. Working with Jupyter notebooks 1.3. The best note-taking app 1.4. Initialising each session and getting example data 1.5. Exercises 2. Scalar types and control structures in Python 2.1. Scalar types 2.2. Calling built-in functions 2.3. Controlling program flow 2.4. Defining functions 2.5. Exercises 3. Sequential and other types in Python 3.1. Sequential types 3.2. Working with sequences 3.3. Dictionaries 3.4. Iterable types 3.5. Object references and copying (*) 3.6. Further reading 3.7. Exercises Unidimensional data 4. Unidimensional numeric data and their empirical distribution 4.1. Creating vectors in numpy 4.2. Some mathematical notation 4.3. Inspecting the data distribution with histograms 4.4. Exercises 5. Processing unidimensional data 5.1. Aggregating numeric data 5.2. Vectorised mathematical functions 5.3. Arithmetic operators 5.4. Indexing vectors 5.5. Other operations 5.6. Exercises 6. Continuous probability distributions 6.1. Normal distribution 6.2. Assessing goodness-of-fit 6.3. Other noteworthy distributions 6.4. Generating pseudorandom numbers 6.5. Further reading 6.6. Exercises Multidimensional data 7. From uni- to multidimensional numeric data 7.1. Creating matrices 7.2. Reshaping matrices 7.3. Mathematical notation 7.4. Visualising multidimensional data 7.5. Exercises 8. Processing multidimensional data 8.1. Extending vectorised operations to matrices 8.2. Indexing matrices 8.3. Matrix multiplication, dot products, and Euclidean norm (*) 8.4. Pairwise distances and related methods (*) 8.5. Exercises 9. Exploring relationships between variables 9.1. Measuring correlation 9.2. Regression tasks (*) 9.3. Finding interesting combinations of variables (*) 9.4. Further reading 9.5. Exercises Heterogeneous data 10. Introducing data frames 10.1. Creating data frames 10.2. Aggregating data frames 10.3. Transforming data frames 10.4. Indexing Series objects 10.5. Indexing data frames 10.6. Further operations on data frames 10.7. Exercises 11. Handling categorical data 11.1. Representing and generating categorical data 11.2. Frequency distributions 11.3. Visualising factors 11.4. Aggregating and comparing factors 11.5. Exercises 12. Processing data in groups 12.1. Basic methods 12.2. Plotting data in groups 12.3. Classification tasks (*) 12.4. Clustering tasks (*) 12.5. Further reading 12.6. Exercises 13. Accessing databases 13.1. Example database 13.2. Exporting data to a database 13.3. Exercises on SQL vs pandas 13.4. Closing the database connection 13.5. Common data serialisation formats for the Web 13.6. Working with many files 13.7. Further reading 13.8. Exercises Other data types 14. Text data 14.1. Basic string operations 14.2. Working with string lists 14.3. Formatted outputs for reproducible report generation 14.4. Regular expressions (*) 14.5. Exercises 15. Missing, censored, and questionable data 15.1. Missing data 15.2. Censored and interval data (*) 15.3. Incorrect data 15.4. Outliers 15.5. Exercises 16. Time series 16.1. Temporal ordering and line charts 16.2. Working with date-times and time-deltas 16.3. Basic operations 16.4. Further reading 16.5. Exercises Appendix Changelog References
Related Books