Understanding Spark's internal architecture is crucial for writing efficient and performant applications.
Core Components:
- Driver Program: The main program that runs on the master node of a cluster. It contains the
main()function, creates the SparkContext, and orchestrates the execution of operations on the cluster. - SparkContext: The entry point to Spark functionality. It connects to the cluster manager and can create RDDs, DataFrames, and Datasets.
- Cluster Manager: An external service (e.g., YARN, Mesos, Kubernetes, Standalone) that acquires resources on the cluster and allocates them to Spark applications.
- Executors: Worker processes that run on the worker nodes. They perform the actual data processing tasks (computations) and store data in memory or disk.
Data Abstractions:
- RDD (Resilient Distributed Dataset): The fundamental data structure of Spark. An immutable, distributed collection of objects that can be operated on in parallel. Low-level API, less optimized.
- DataFrame: A distributed collection of data organized into named columns, conceptually equivalent to a table in a relational database. Provides schema and allows for Catalyst Optimizer optimizations. Available in Scala, Java, Python, R.
- Dataset: Combines the benefits of RDDs (strong typing, compile-time safety) and DataFrames (Catalyst Optimizer, Tungsten optimizations). Available in Scala and Java.
Execution Flow:
- DAG (Directed Acyclic Graph) Scheduler: Spark builds a DAG of operations based on transformations (e.g.,
map,filter) and actions (e.g.,collect,count). - Stages: The DAG is broken down into stages. A stage consists of a set of tasks that can be run in parallel without a shuffle. A shuffle operation typically marks the boundary between stages.
- Tasks: The smallest unit of execution in Spark. Each task processes a partition of data. Tasks are sent to executors for execution.
- Transformations vs. Actions:
- Transformations: Operations that create a new RDD/DataFrame/Dataset from an existing one (e.g.,
filter,map,join). They are lazy, meaning they don't execute until an action is called. - Actions: Operations that trigger the execution of the DAG and return a result to the driver or write data to an external storage (e.g.,
count,show,write).
- Transformations: Operations that create a new RDD/DataFrame/Dataset from an existing one (e.g.,
Shuffle Operations:
A shuffle is an expensive operation that involves redistributing data across partitions, often requiring data to be written to disk and transferred across the network. Operations like groupByKey, reduceByKey, join, repartition typically trigger a shuffle.
Simplified Spark Execution Diagram (Text-based):
User Code (Transformations & Actions)
|
V
Driver Program (SparkContext)
|
V
Logical Plan (Unoptimized)
| (Catalyst Optimizer)
V
Optimized Logical Plan
|
V
Physical Plan (DAG of RDDs)
| (DAG Scheduler)
V
Stages (separated by shuffles)
| (Task Scheduler)
V
Tasks (sent to Executors)
|
V
Executors (process data partitions)