The significantly expanded and updated new edition of a widely used text on reinforcement learning, one of the most active research areas in artificial intelligence.
Reinforcement learning, one of the most active research areas in artificial intelligence, is a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives while interacting with a complex, uncertain environment. In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the field's key ideas and algorithms. This second edition has been significantly expanded and updated, presenting new topics and updating coverage of other topics.
Like the first edition, this second edition focuses on core online learning algorithms, with the more mathematical material set off in shaded boxes. Part I covers as much of reinforcement learning as possible without going beyond the tabular case for which exact solutions can be found. Many algorithms presented in this part are new to the second edition, including UCB, Expected Sarsa, and Double Learning. Part II extends these ideas to function approximation, with new sections on such topics as artificial neural networks and the Fourier basis, and offers expanded treatment of off-policy learning and policy-gradient methods. Part III has new chapters on reinforcement learning's relationships to psychology and neuroscience, as well as an updated case-studies chapter including AlphaGo and AlphaGo Zero, Atari game playing, and IBM Watson's wagering strategy. The final chapter discusses the future societal impacts of reinforcement learning.
Conditions of Use
This book is licensed under a Creative Commons License (CC BY-NC-SA). You can download the ebook Reinforcement Learning, 2nd Edition for free.
- Title
- Reinforcement Learning, 2nd Edition
- Subtitle
- An Introduction
- Publisher
- The MIT Press
- Author(s)
- Andrew G. Barto, Richard S. Sutton
- Published
- 2018-11-23
- Edition
- 2
- Format
- eBook (pdf, epub, mobi)
- Pages
- 548
- Language
- English
- ISBN-10
- 0262039249
- ISBN-13
- 9780262039246
- License
- CC BY-NC-SA
- Book Homepage
- Free eBook, Errata, Code, Solutions, etc.
Preface to the Second Edition Preface to the First Edition Summary of Notation Introduction Reinforcement Learning Examples Elements of Reinforcement Learning Limitations and Scope An Extended Example: Tic-Tac-Toe Summary Early History of Reinforcement Learning I Tabular Solution Methods Multi-armed Bandits A k-armed Bandit Problem Action-value Methods The 10-armed Testbed Incremental Implementation Tracking a Nonstationary Problem Optimistic Initial Values Upper-Confidence-Bound Action Selection Gradient Bandit Algorithms Associative Search (Contextual Bandits) Summary Finite Markov Decision Processes The Agent–Environment Interface Goals and Rewards Returns and Episodes Unified Notation for Episodic and Continuing Tasks Policies and Value Functions Optimal Policies and Optimal Value Functions Optimality and Approximation Summary Dynamic Programming Policy Evaluation (Prediction) Policy Improvement Policy Iteration Value Iteration Asynchronous Dynamic Programming Generalized Policy Iteration Efficiency of Dynamic Programming Summary Monte Carlo Methods Monte Carlo Prediction Monte Carlo Estimation of Action Values Monte Carlo Control Monte Carlo Control without Exploring Starts Off-policy Prediction via Importance Sampling Incremental Implementation Off-policy Monte Carlo Control *Discounting-aware Importance Sampling *Per-decision Importance Sampling Summary Temporal-Difference Learning TD Prediction Advantages of TD Prediction Methods Optimality of TD(0) Sarsa: On-policy TD Control Q-learning: Off-policy TD Control Expected Sarsa Maximization Bias and Double Learning Games, Afterstates, and Other Special Cases Summary n-step Bootstrapping n-step TD Prediction n-step Sarsa n-step Off-policy Learning *Per-decision Methods with Control Variates Off-policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm *A Unifying Algorithm: n-step Q() Summary Planning and Learning with Tabular Methods Models and Planning Dyna: Integrated Planning, Acting, and Learning When the Model Is Wrong Prioritized Sweeping Expected vs. Sample Updates Trajectory Sampling Real-time Dynamic Programming Planning at Decision Time Heuristic Search Rollout Algorithms Monte Carlo Tree Search Summary of the Chapter Summary of Part I: Dimensions II Approximate Solution Methods On-policy Prediction with Approximation Value-function Approximation The Prediction Objective (VE) Stochastic-gradient and Semi-gradient Methods Linear Methods Feature Construction for Linear Methods Polynomials Fourier Basis Coarse Coding Tile Coding Radial Basis Functions Selecting Step-Size Parameters Manually [23pt][l]9.7Nonlinear Function Approximation: Artificial Neural Networks Least-Squares TD Memory-based Function Approximation Kernel-based Function Approximation Looking Deeper at On-policy Learning: Interest and Emphasis Summary On-policy Control with Approximation Episodic Semi-gradient Control Semi-gradient n-step Sarsa Average Reward: A New Problem Setting for Continuing Tasks Deprecating the Discounted Setting Differential Semi-gradient n-step Sarsa Summary *Off-policy Methods with Approximation Semi-gradient Methods Examples of Off-policy Divergence The Deadly Triad Linear Value-function Geometry Gradient Descent in the Bellman Error The Bellman Error is Not Learnable Gradient-TD Methods Emphatic-TD Methods Reducing Variance Summary Eligibility Traces The -return TD() n-step Truncated -return Methods Redoing Updates: Online -return Algorithm True Online TD() *Dutch Traces in Monte Carlo Learning Sarsa() Variable and *Off-policy Traces with Control Variates Watkins's Q() to Tree-Backup() Stable Off-policy Methods with Traces Implementation Issues Conclusions Policy Gradient Methods Policy Approximation and its Advantages The Policy Gradient Theorem REINFORCE: Monte Carlo Policy Gradient REINFORCE with Baseline Actor–Critic Methods Policy Gradient for Continuing Problems Policy Parameterization for Continuous Actions Summary III Looking Deeper Psychology Prediction and Control Classical Conditioning Blocking and Higher-order Conditioning The Rescorla–Wagner Model The TD Model TD Model Simulations Instrumental Conditioning Delayed Reinforcement Cognitive Maps Habitual and Goal-directed Behavior Summary Neuroscience Neuroscience Basics Reward Signals, Reinforcement Signals, Values, and Prediction Errors The Reward Prediction Error Hypothesis Dopamine [23pt][l]15.5Experimental Support for the Reward Prediction Error Hypothesis TD Error/Dopamine Correspondence Neural Actor–Critic Actor and Critic Learning Rules Hedonistic Neurons Collective Reinforcement Learning Model-based Methods in the Brain Addiction Summary Applications and Case Studies TD-Gammon Samuel's Checkers Player Watson's Daily-Double Wagering Optimizing Memory Control Human-level Video Game Play Mastering the Game of Go AlphaGo AlphaGo Zero Personalized Web Services Thermal Soaring Frontiers General Value Functions and Auxiliary Tasks Temporal Abstraction via Options Observations and State Designing Reward Signals Remaining Issues The Future of Artificial Intelligence References Index References Index