This open access book is a modern guide for all C++ programmers to learn Threading Building Blocks (TBB). Written by TBB and parallel programming experts, this book reflects their collective decades of experience in developing and teaching parallel programming with TBB, offering their insights in an approachable manner. Throughout the book the authors present numerous examples and best practices to help you become an effective TBB programmer and leverage the power of parallel systems.
Pro TBB starts with the basics, explaining parallel algorithms and C++'s built-in standard template library for parallelism. You'll learn the key concepts of managing memory, working with data structures and how to handle typical issues with synchronization. Later chapters apply these ideas to complex systems to explain performance tradeoffs, mapping common parallel patterns, controlling threads and overhead, and extending TBB to program heterogeneous systems or system-on-chips.
What You'll Learn
- Use Threading Building Blocks to produce code that is portable, simple, scalable, and more understandable
- Review best practices for parallelizing computationally intensive tasks in your applications
- Integrate TBB with other threading packages
- Create scalable, high performance data-parallel programs
- Work with generic programming to write efficient algorithms
Who This Book Is For
C++ programmers learning to run applications on multicore systems, as well as C or C++ programmers without much experience with templates. No previous experience with parallel programming or multicore processors is required.
Conditions of Use
This book is licensed under a Creative Commons License (CC BY-NC-ND). You can download the ebook Pro TBB for free.
- Title
- Pro TBB
- Subtitle
- C++ Parallel Programming with Threading Building Blocks
- Publisher
- Apress
- Author(s)
- James Reinders, Michael Voss, Rafael Asenjo
- Published
- 2019-07-10
- Edition
- 1
- Format
- eBook (pdf, epub, mobi)
- Pages
- 820
- Language
- English
- ISBN-10
- 1484243978
- ISBN-13
- 9781484243985
- License
- CC BY-NC-ND
- Book Homepage
- Free eBook, Errata, Code, Solutions, etc.
Table of Contents About the Authors Acknowledgments Preface Part Chapter 1: Jumping Right In: “Hello, TBB!” Why Threading Building Blocks? Performance: Small Overhead, Big Benefits for C++ Evolving Support for Parallelism in TBB and C++ Recent C++ Additions for Parallelism The Threading Building Blocks (TBB) Library Parallel Execution Interfaces Interfaces That Are Independent of the Execution Model Using the Building Blocks in TBB Let’s Get Started Already! Getting the Threading Building Blocks (TBB) Library Getting a Copy of the Examples Writing a First “Hello, TBB!” Example Building the Simple Examples Steps to Set Up an Environment Building on Windows Using Microsoft Visual Studio Building on a Linux Platform from a Terminal Using the Intel Compiler tbbvars and pstlvars Scripts Setting Up Variables Manually Without Using the tbbvars Script or the Intel Compiler A More Complete Example Starting with a Serial Implementation Adding a Message-Driven Layer Using a Flow Graph Adding a Fork-Join Layer Using a parallel_for Adding a SIMD Layer Using a Parallel STL Transform Summary Chapter 2: Generic Parallel Algorithms Functional / Task Parallelism A Slightly More Complicated Example: A Parallel Implementation of Quicksort Loops: parallel_for, parallel_reduce, and parallel_scan parallel_for: Applying a Body to Each Element in a Range A Slightly More Complicated Example: Parallel Matrix Multiplication parallel_reduce: Calculating a Single Result Across a Range A Slightly More Complicated Example: Calculating π by Numerical Integration parallel_scan: A Reduction with Intermediate Values How Does This Work? A Slightly More Complicated Example: Line of Sight Cook Until Done: parallel_do and parallel_pipeline parallel_do: Apply a Body Until There Are No More Items Left A Slightly More Complicated Example: Forward Substitution parallel_pipeline: Streaming Items Through a Series of Filters A Slightly More Complicated Example: Creating 3D Stereoscopic Images Summary For More Information Chapter 3: Flow Graphs Why Use Graphs to Express Parallelism? The Basics of the TBB Flow Graph Interface Step 1: Create the Graph Object Step 2: Make the Nodes Step 3: Add Edges Step 4: Start the Graph Step 5: Wait for the Graph to Complete Executing A More Complicated Example of a Data Flow Graph Implementing the Example as a TBB Flow Graph Understanding the Performance of a Data Flow Graph The Special Case of Dependency Graphs Implementing a Dependency Graph Estimating the Scalability of a Dependency Graph Advanced Topics in TBB Flow Graphs Summary Chapter 4: TBB and the Parallel Algorithms of the C++ Standard Template Library Does the C++ STL Library Belong in This Book? A Parallel STL Execution Policy Analogy A Simple Example Using std::for_each What Algorithms Are Provided in a Parallel STL Implementation? How to Get and Use a Copy of Parallel STL That Uses TBB Algorithms in Intel’s Parallel STL Capturing More Use Cases with Custom Iterators Highlighting Some of the Most Useful Algorithms std::for_each, std::for_each_n std::transform std::reduce std::transform_reduce A Deeper Dive into the Execution Policies The sequenced_policy The parallel_policy The unsequenced_policy The parallel_unsequenced_policy Which Execution Policy Should We Use? Other Ways to Introduce SIMD Parallelism Summary For More Information Chapter 5: Synchronization: Why and How to Avoid It A Running Example: Histogram of an Image An Unsafe Parallel Implementation A First Safe Parallel Implementation: Coarse-Grained Locking Mutex Flavors A Second Safe Parallel Implementation: Fine-Grained Locking A Third Safe Parallel Implementation: Atomics A Better Parallel Implementation: Privatization and Reduction Thread Local Storage, TLS enumerable_thread_specific, ETS combinable The Easiest Parallel Implementation: Reduction Template Recap of Our Options Summary For More Information Chapter 6: Data Structures for Concurrency Key Data Structures Basics Unordered Associative Containers Map vs. Set Multiple Values Hashing Unordered Concurrent Containers Concurrent Unordered Associative Containers concurrent_hash_map Concurrent Support for map/multimap and set/multiset Interfaces Built-In Locking vs. No Visible Locking Iterating Through These Structures Is Asking for Trouble Concurrent Queues: Regular, Bounded, and Priority Bounding Size Priority Ordering Staying Thread-Safe: Try to Forget About Top, Size, Empty, Front, Back Iterators Why to Use This Concurrent Queue: The A-B-A Problem When to NOT Use Queues: Think Algorithms! Concurrent Vector When to Use tbb::concurrent_vector Instead of std::vector Elements Never Move Concurrent Growth of concurrent_vectors Summary Chapter 7: Scalable Memory Allocation Modern C++ Memory Allocation Scalable Memory Allocation: What Scalable Memory Allocation: Why Avoiding False Sharing with Padding Scalable Memory Allocation Alternatives: Which Compilation Considerations Most Popular Usage (C/C++ Proxy Library): How Linux: malloc/new Proxy Library Usage macOS: malloc/new Proxy Library Usage Windows: malloc/new Proxy Library Usage Testing our Proxy Library Usage C Functions: Scalable Memory Allocators for C C++ Classes: Scalable Memory Allocators for C++ Allocators with std::allocator Signature scalable_allocator tbb_allocator zero_allocator cached_aligned_allocator Memory Pool Support: memory_pool_allocator Array Allocation Support: aligned_space Replacing new and delete Selectively Performance Tuning: Some Control Knobs What Are Huge Pages? TBB Support for Huge Pages scalable_allocation_mode(int mode, intptr_t value) TBBMALLOC_USE_HUGE_PAGES TBBMALLOC_SET_SOFT_HEAP_LIMIT int scalable_allocation_command(int cmd, void ∗param) TBBMALLOC_CLEAN_ALL_BUFFERS TBBMALLOC_CLEAN_THREAD_BUFFERS Summary Chapter 8: Mapping Parallel Patterns to TBB Parallel Patterns vs. Parallel Algorithms Patterns Categorize Algorithms, Designs, etc. Patterns That Work Data Parallelism Wins Nesting Pattern Map Pattern Workpile Pattern Reduction Patterns (Reduce and Scan) Fork-Join Pattern Divide-and-Conquer Pattern Branch-and-Bound Pattern Pipeline Pattern Event-Based Coordination Pattern (Reactive Streams) Summary For More Information Part Chapter 9: The Pillars of Composability What Is Composability? Nested Composition Concurrent Composition Serial Composition The Features That Make TBB a Composable Library The TBB Thread Pool (the Market) and Task Arenas The TBB Task Dispatcher: Work Stealing and More Putting It All Together Looking Forward Controlling the Number of Threads Work Isolation Task-to-Thread and Thread-to-Core Affinity Task Priorities Summary For More Information Chapter 10: Using Tasks to Create Your Own Algorithms A Running Example: The Sequence The High-Level Approach: parallel_invoke The Highest Among the Lower: task_group The Low-Level Task Interface: Part One – Task Blocking The Low-Level Task Interface: Part Two – Task Continuation Bypassing the Scheduler The Low-Level Task Interface: Part Three – Task Recycling Task Interface Checklist One More Thing: FIFO (aka Fire-and-Forget) Tasks Putting These Low-Level Features to Work Summary For More Information Chapter 11: Controlling the Number of Threads Used for Execution A Brief Recap of the TBB Scheduler Architecture Interfaces for Controlling the Number of Threads Controlling Thread Count with task_scheduler_init Controlling Thread Count with task_arena Controlling Thread Count with global_control Summary of Concepts and Classes The Best Approaches for Setting the Number of Threads Using a Single task_scheduler_init Object for a Simple Application Using More Than One task_scheduler_init Object in a Simple Application Using Multiple Arenas with Different Numbers of Slots to Influence Where TBB Places Its Worker Threads Using global_control to Control How Many Threads Are Available to Fill Arena Slots Using global_control to Temporarily Restrict the Number of Available Threads When NOT to Control the Number of Threads Figuring Out What’s Gone Wrong Summary Chapter 12: Using Work Isolation for Correctness and Performance Work Isolation for Correctness Creating an Isolated Region with this_task_arena::isolate Oh No! Work Isolation Can Cause Its Own Correctness Issues! Even When It Is Safe, Work Isolation Is Not Free Using Task Arenas for Isolation: A Double-Edged Sword Don’t Be Tempted to Use task_arenas to Create Work Isolation for Correctness Summary For More Information Chapter 13: Creating Thread-to-Core and Task-to-Thread Affinity Creating Thread-to-Core Affinity Creating Task-to-Thread Affinity When and How Should We Use the TBB Affinity Features? Summary For More Information Chapter 14: Using Task Priorities Support for Non-Preemptive Priorities in the TBB Task Class Setting Static and Dynamic Priorities Two Small Examples Implementing Priorities Without Using TBB Task Support Summary For More Information Chapter 15: Cancellation and Exception Handling How to Cancel Collective Work Advanced Task Cancellation Explicit Assignment of TGC Default Assignment of TGC Exception Handling in TBB Tailoring Our Own TBB Exceptions Putting All Together: Composability, Cancellation, and Exception Handling Summary For More Information Chapter 16: Tuning TBB Algorithms: Granularity, Locality, Parallelism, and Determinism Task Granularity: How Big Is Big Enough? Choosing Ranges and Partitioners for Loops An Overview of Partitioners Choosing a Grainsize (or Not) to Manage Task Granularity Ranges, Partitioners, and Data Cache Performance Cache-Oblivious Algorithms Cache Affinity Using a static_partitioner Restricting the Scheduler for Determinism Tuning TBB Pipelines: Number of Filters, Modes, and Tokens Understanding a Balanced Pipeline Understanding an Imbalanced Pipeline Pipelines and Data Locality and Thread Affinity Deep in the Weeds Making Your Own Range Type The Pipeline Class and Thread-Bound Filters Summary For More Information Chapter 17: Flow Graphs: Beyond the Basics Optimizing for Granularity, Locality, and Parallelism Node Granularity: How Big Is Big Enough? What to Do If Nodes Are Too Small Memory Usage and Data Locality Data Locality in Flow Graphs Picking the Best Message Type and Limiting the Number of Messages in Flight Task Arenas and Flow Graph The Default Arena Used by a Flow Graph Changing the Task Arena Used by a Flow Graph Setting the Number of Threads, Thread-to-Core Affinities, etc. Key FG Advice: Dos and Don’ts Do: Use Nested Parallelism Don’t: Use Multifunction Nodes in Place of Nested Parallelism Do: Use join_node, sequencer_node, or multifunction_node to Reestablish Order in a Flow Graph When Needed Do: Use the Isolate Function for Nested Parallelism Do: Use Cancellation and Exception Handling in Flow Graphs Each Flow Graph Uses a Single task_group_context Canceling a Flow Graph Resetting a Flow Graph After Cancellation Exception Handling Examples Do: Set a Priority for a Graph Using task_group_ context Don’t: Make an Edge Between Nodes in Different Graphs Do: Use try_put to Communicate Across Graphs Do: Use composite_node to Encapsulate Groups of Nodes Introducing Intel Advisor: Flow Graph Analyzer The FGA Design Workflow Tips for Iterative Development with FGA The FGA Analysis Workflow Diagnosing Performance Issues with FGA Diagnosing Granularity Issues with FGA Recognizing Slow Copies in FGA Diagnosing Moonlighting using FGA Summary For More Information Chapter 18: Beef Up Flow Graphs with Async Nodes Async World Example Why and When async_node? A More Realistic Example Summary For More Information Chapter 19: Flow Graphs on Steroids: OpenCL Nodes Hello OpenCL_Node Example Where Are We Running Our Kernel? Back to the More Realistic Example of Chapter The Devil Is in the Details The NDRange Concept Playing with the Offset Specifying the OpenCL Kernel Even More on Device Selection A Warning Regarding the Order Is in Order! Summary For More Information Chapter 20: TBB on NUMA Architectures Discovering Your Platform Topology Understanding the Costs of Accessing Memory Our Baseline Example Mastering Data Placement and Processor Affinity Putting hwloc and TBB to Work Together More Advanced Alternatives Summary For More Information Appendix A: History and Inspiration A Decade of “Hatchling to Soaring” 1 TBB’s Revolution Inside Intel 2 TBB’s First Revolution of Parallelism 3 TBB’s Second Revolution of Parallelism 4 TBB’s Birds Inspiration for TBB Relaxed Sequential Execution Model Influential Libraries Influential Languages Influential Pragmas Influences of Generic Programming Considering Caches Considering Costs of Time Slicing Further Reading Appendix B: TBB Précis Debug and Conditional Coding Preview Feature Macros Ranges Partitioners Algorithms Algorithm: parallel_do Algorithm: parallel_for Algorithm: parallel_for_each Algorithm: parallel_invoke Algorithm: parallel_pipeline Algorithm: parallel_reduce and parallel_deterministic_reduce Algorithm: parallel_scan Algorithm: parallel_sort Algorithm: pipeline Flow Graph Flow Graph: graph class Flow Graph: ports and edges Flow Graph: nodes tbb::flow::tuple vs. std::tuple Graph Policy (namespace) Memory Allocation Containers Synchronization Thread Local Storage (TLS) Timing Task Groups: Use of the Task Stealing Scheduler Task Scheduler: Fine Control of the Task Stealing Scheduler Floating-Point Settings Exceptions Threads Parallel STL Glossary Index