Pyspark

1. What is Spark?

Spark is defined as a "distributed computing engine which distributes our data to process it." This means that instead of processing large datasets on a single machine, Spark leverages "multiple machines" connected as a "cluster" to "scale it to the infinite numbers." This distributed architecture allows for efficient processing of large data volumes, overcoming the limitations of single machines.

2. Spark Architecture (The Master-Slave Model)

The Spark architecture, crucial for interview questions, operates on a "Master Slave architecture."

Cluster Manager: Manages resources and the overall cluster of computers.

Driver Node (Master): Created by the Cluster Manager. It receives the user's code, assesses it, and breaks it down into "transformation stages, jobs, tasks." This information is then handed over to the Cluster Manager.

Worker Nodes (Slaves): Requested by the Driver Program from the Cluster Manager. These nodes "actually execute those Transformations" provided by the Driver Program.

In essence, the "driver program will orchestrate the tasks," breaking down the code, and the "worker nodes will actually process our data."

Spark Has 3 Main Components:

Component	Simple Explanation
Driver Program	The brain – decides what needs to be done.
Cluster Manager	The coordinator – assigns machines and resources.
Worker Nodes	The workers – actually process the data.

🔄 How Spark Works (Big Picture):

You write a program using PySpark/Scala/Java.
Spark creates a Driver Program to run your code.
The Driver:
- Talks to the Cluster Manager (like YARN, Kubernetes, or Spark's own manager).
- Asks for Worker Nodes to do the job.
Worker nodes run executors, which:
- Read data
- Run transformations
- Write results

Intermediate Level: Jobs, Stages, and Tasks
🧱 When you run a Spark job:

🧩 Spark Workflow:

Your Code → Driver → Job → Stages → Tasks → Workers → Results

🔴 Advanced Level: RDDs, DAG, Executors, and Memory
🔹 RDDs (Resilient Distributed Datasets)

🔹 DAG (Directed Acyclic Graph)

🔹 Executors

🔹 Memory Management

Now let’s go deeper.

The Driver breaks your code into Jobs (e.g., read + filter + group + write).
Each Job is broken into Stages.
- A stage is a group of operations that can be done together.
Each Stage is broken into Tasks.
- Each Task processes a chunk of data (like one partition).

👉 These Tasks are sent to the Worker nodes, which run them in parallel.

Spark’s core data structure.
RDD = A large dataset split across machines.
Spark tracks how to recreate RDDs if a node fails (called lineage).

Spark creates a DAG for your job.
It’s a graph of all operations to run in order.
It helps Spark optimize execution before it starts running.

Every Worker node runs an Executor.
Executors:
- Run the Tasks assigned by the Driver.
- Keep data in memory (for fast processing).
- Report progress back to the Driver.

Spark stores data in RAM as much as possible.
It divides memory into:
- Storage memory (for cached data)
- Execution memory (for computations)
If memory is full, it spills to disk (slower).

Difference between Hadoop and Spark

The most significant difference lies in how data is processed. Spark uses in-memory computation, meaning it keeps data in RAM during processing instead of constantly reading from and writing to disk like Hadoop MapReduce. This leads to much faster performance, especially for iterative algorithms or interactive analytics.
Another major benefit is lazy evaluation—in Spark, transformations (like map, filter, etc.) are not executed immediately. Instead, Spark builds a logical execution plan and only runs the computation when an action (like collect, count, write) is triggered. This allows Spark to optimize the execution plan, often combining steps or reordering operations to reduce the number of data shuffles and improve speed.
Spark also provides fault tolerance through a mechanism called lineage. If a node fails, Spark can recompute lost data using the transformation history, without needing to replicate data across nodes like Hadoop. Additionally, Spark’s data partitioning model is a foundational concept that allows it to split large datasets into smaller chunks, distribute them across the cluster, and perform parallel processing efficiently.

In summary, Spark surpasses Hadoop in performance, optimization capabilities, fault tolerance, and efficient parallel processing, making it the preferred choice for modern big data applications.

4. Job, Stages, and Tasks (Hierarchical Structure)

Spark executes code in a hierarchical manner:

Job: The entire "code that we submit in every cell."

Stages: A job is divided into "different different stages."

Tasks: Each stage can have "multiple tasks."

5. Spark Language Options & PySpark

Spark provides APIs (Application Programming Interfaces) to write code in various languages:

Python (PySpark)
Scala (often preferred by those familiar with Java)
SQL
R

Shivaram Babar

Search This Blog