Reinforcement Learning, 2nd Edition

The significantly expanded and updated new edition of a widely used text on reinforcement learning, one of the most active research areas in artificial intelligence.

Reinforcement learning, one of the most active research areas in artificial intelligence, is a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives while interacting with a complex, uncertain environment. In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the field's key ideas and algorithms. This second edition has been significantly expanded and updated, presenting new topics and updating coverage of other topics.

Like the first edition, this second edition focuses on core online learning algorithms, with the more mathematical material set off in shaded boxes. Part I covers as much of reinforcement learning as possible without going beyond the tabular case for which exact solutions can be found. Many algorithms presented in this part are new to the second edition, including UCB, Expected Sarsa, and Double Learning. Part II extends these ideas to function approximation, with new sections on such topics as artificial neural networks and the Fourier basis, and offers expanded treatment of off-policy learning and policy-gradient methods. Part III has new chapters on reinforcement learning's relationships to psychology and neuroscience, as well as an updated case-studies chapter including AlphaGo and AlphaGo Zero, Atari game playing, and IBM Watson's wagering strategy. The final chapter discusses the future societal impacts of reinforcement learning.

Conditions of Use

This book is licensed under a Creative Commons License (CC BY-NC-SA). You can download the ebook Reinforcement Learning, 2nd Edition for free.

Title: Reinforcement Learning, 2nd Edition
Subtitle: An Introduction
Publisher: The MIT Press
Author(s): Andrew G. Barto, Richard S. Sutton
Published: 2018-11-23
Edition: 2
Format: eBook (pdf, epub, mobi)
Pages: 548
Language: English
ISBN-10: 0262039249
ISBN-13: 9780262039246
License: CC BY-NC-SA
Book Homepage: Free eBook, Errata, Code, Solutions, etc.

Preface to the Second Edition
Preface to the First Edition
Summary of Notation
Introduction
	Reinforcement Learning
	Examples
	Elements of Reinforcement Learning
	Limitations and Scope
	An Extended Example: Tic-Tac-Toe
	Summary
	Early History of Reinforcement Learning
I Tabular Solution Methods
	Multi-armed Bandits
		A k-armed Bandit Problem
		Action-value Methods
		The 10-armed Testbed
		Incremental Implementation
		Tracking a Nonstationary Problem
		Optimistic Initial Values
		Upper-Confidence-Bound Action Selection
		Gradient Bandit Algorithms
		Associative Search (Contextual Bandits)
		Summary
	Finite Markov Decision Processes
		The Agent–Environment Interface
		Goals and Rewards
		Returns and Episodes
		Unified Notation for Episodic and Continuing Tasks
		Policies and Value Functions
		Optimal Policies and Optimal Value Functions
		Optimality and Approximation
		Summary
	Dynamic Programming
		Policy Evaluation (Prediction)
		Policy Improvement
		Policy Iteration
		Value Iteration
		Asynchronous Dynamic Programming
		Generalized Policy Iteration
		Efficiency of Dynamic Programming
		Summary
	Monte Carlo Methods
		Monte Carlo Prediction
		Monte Carlo Estimation of Action Values
		Monte Carlo Control
		Monte Carlo Control without Exploring Starts
		Off-policy Prediction via Importance Sampling
		Incremental Implementation
		Off-policy Monte Carlo Control
		*Discounting-aware Importance Sampling
		*Per-decision Importance Sampling
		Summary
	Temporal-Difference Learning
		TD Prediction
		Advantages of TD Prediction Methods
		Optimality of TD(0)
		Sarsa: On-policy TD Control
		Q-learning: Off-policy TD Control
		Expected Sarsa
		Maximization Bias and Double Learning
		Games, Afterstates, and Other Special Cases
		Summary
	n-step Bootstrapping
		n-step TD Prediction
		n-step Sarsa
		n-step Off-policy Learning
		*Per-decision Methods with Control Variates
		Off-policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm
		*A Unifying Algorithm: n-step Q()
		Summary
	Planning and Learning with Tabular Methods
		Models and Planning
		Dyna: Integrated Planning, Acting, and Learning
		When the Model Is Wrong
		Prioritized Sweeping
		Expected vs. Sample Updates
		Trajectory Sampling
		Real-time Dynamic Programming
		Planning at Decision Time
		Heuristic Search
		Rollout Algorithms
		Monte Carlo Tree Search
		Summary of the Chapter
		Summary of Part I: Dimensions
II Approximate Solution Methods
	On-policy Prediction with Approximation
		Value-function Approximation
		The Prediction Objective (VE)
		Stochastic-gradient and Semi-gradient Methods
		Linear Methods
		Feature Construction for Linear Methods
			Polynomials
			Fourier Basis
			Coarse Coding
			Tile Coding
			Radial Basis Functions
		Selecting Step-Size Parameters Manually
		[23pt][l]9.7Nonlinear Function Approximation: Artificial Neural Networks
		Least-Squares TD
		Memory-based Function Approximation
		Kernel-based Function Approximation
		Looking Deeper at On-policy Learning: Interest and Emphasis
		Summary
	On-policy Control with Approximation
		Episodic Semi-gradient Control
		Semi-gradient n-step Sarsa
		Average Reward: A New Problem Setting for Continuing Tasks
		Deprecating the Discounted Setting
		Differential Semi-gradient n-step Sarsa
		Summary
	*Off-policy Methods with Approximation
		Semi-gradient Methods
		Examples of Off-policy Divergence
		The Deadly Triad
		Linear Value-function Geometry
		Gradient Descent in the Bellman Error
		The Bellman Error is Not Learnable
		Gradient-TD Methods
		Emphatic-TD Methods
		Reducing Variance
		Summary
	Eligibility Traces
		The -return
		TD()
		n-step Truncated -return Methods
		Redoing Updates: Online -return Algorithm
		True Online TD()
		*Dutch Traces in Monte Carlo Learning
		Sarsa()
		Variable and
		*Off-policy Traces with Control Variates
		Watkins's Q() to Tree-Backup()
		Stable Off-policy Methods with Traces
		Implementation Issues
		Conclusions
	Policy Gradient Methods
		Policy Approximation and its Advantages
		The Policy Gradient Theorem
		REINFORCE: Monte Carlo Policy Gradient
		REINFORCE with Baseline
		Actor–Critic Methods
		Policy Gradient for Continuing Problems
		Policy Parameterization for Continuous Actions
		Summary
III Looking Deeper
	Psychology
		Prediction and Control
		Classical Conditioning
			Blocking and Higher-order Conditioning
			The Rescorla–Wagner Model
			The TD Model
			TD Model Simulations
		Instrumental Conditioning
		Delayed Reinforcement
		Cognitive Maps
		Habitual and Goal-directed Behavior
		Summary
	Neuroscience
		Neuroscience Basics
		Reward Signals, Reinforcement Signals, Values, and Prediction Errors
		The Reward Prediction Error Hypothesis
		Dopamine
		[23pt][l]15.5Experimental Support for the Reward Prediction Error Hypothesis
		TD Error/Dopamine Correspondence
		Neural Actor–Critic
		Actor and Critic Learning Rules
		Hedonistic Neurons
		Collective Reinforcement Learning
		Model-based Methods in the Brain
		Addiction
		Summary
	Applications and Case Studies
		TD-Gammon
		Samuel's Checkers Player
		Watson's Daily-Double Wagering
		Optimizing Memory Control
		Human-level Video Game Play
		Mastering the Game of Go
			AlphaGo
			AlphaGo Zero
		Personalized Web Services
		Thermal Soaring
	Frontiers
		General Value Functions and Auxiliary Tasks
		Temporal Abstraction via Options
		Observations and State
		Designing Reward Signals
		Remaining Issues
		The Future of Artificial Intelligence
	References
	Index
References
Index