What is shuffle in Apache Spark, and why is it an expensive operation?

  • A data re-distribution process during transformations
  • A process of joining two datasets
  • A process of re-partitioning data for parallel processing
  • A task scheduling mechanism in Spark
Shuffle in Apache Spark involves re-distributing data across partitions, often required after certain transformations like groupBy or sortByKey, making it an expensive operation due to data movement across the cluster.
Add your answer
Loading...

Leave a comment

Your email address will not be published. Required fields are marked *