What is it?
What it actually does
Real-time example (Structured Streaming with Kafka)
Quick mental picture
Connecting to the cluster manager (Standalone, YARN, Mesos, Kubernetes)
Cluster manager is a platform (cluster mode) where we can run Spark. Simply put, cluster manager provides resources to all worker nodes as per need, it operates all nodes accordingly.
We can say there are a master node and worker nodes available in a cluster. That master nodes provide an efficient working environment to worker nodes.
There are three types of Spark cluster manager. Spark supports these cluster manager:
Standalone cluster manager
Hadoop Yarn
Apache Mesos
Apache Spark also supports pluggable cluster management. The main task of cluster manager is to provide resources to all applications. We can say it is an external service for acquiring required resources on the cluster.
What is it?
Think of the cluster manager as the traffic controller + receptionist for your Spark jobs. It doesn’t do the data work itself—it finds machines, gives Spark the CPUs/RAM it needs, starts/stops worker processes, and keeps them healthy.
What it actually does
Finds resources
: “You need 50 CPU cores and 100 GB RAM? I’ll reserve those on these machines.”
Starts executors
: Launches the Spark executors (the workers that run your tasks).
Schedules & monitors
: Keeps track of which tasks are running where, restarts them if a machine dies.
Scales up/down
: Can add or remove executors (dynamic allocation) based on load.
Shares the cluster
: Enforces queues/quotas so multiple teams can run jobs without stepping on each other.
(Under the hood, Spark can use different cluster managers: Standalone, YARN, Kubernetes, or Mesos. Same idea, different environments.)
Real-time example (Structured Streaming with Kafka)
Imagine a fraud-detection pipeline:
Driver program starts your streaming query (read from Kafka, window and aggregate, write alerts to a database).
The cluster manager (say, Kubernetes) receives the request and allocates pods/nodes with enough CPU/RAM.
It launches executors across those nodes; Spark sends mini-batches/micro-tasks to them.
If traffic spikes (more transactions), the cluster manager adds more executors so you keep low latency.
If a node crashes, it restarts the lost executors elsewhere and Spark retries the failed tasks—your stream keeps running.
At night when traffic drops, it scales down to save resources.
Quick mental picture
Driver app → asks Cluster Manager → gets Executors → Executors process your data.
Connecting to the cluster manager (Standalone, YARN, Mesos, Kubernetes)
That’s it: the cluster manager’s job is to make sure the right amount of muscle is available and kept alive so your Spark job—batch or real-time—runs smoothly.
Spark needs a cluster manager to acquire resources (executors/cores/memory). You choose it via --master (or in code with .master(...) for local testing).
How to do it
Standalone
: Spark’s built-in manager
YARN
: Hadoop clusters
Kubernetes
: Containerized clusters
Mesos
: (legacy, rarely used today)
Commands (spark-submit)
# Standalone cluster (replace host:port with your master URL)
spark-submit \
--master spark://spark-master:7077 \
--deploy-mode client \
--conf spark.executor.instances=4 \
app.py
# YARN (cluster mode)
spark-submit \
--master yarn \
--deploy-mode cluster \
--conf spark.executor.instances=4 \
--conf spark.executor.memory=4g \
app.py
# Kubernetes (example)
spark-submit \
--master k8s://https://my-k8s-apiserver:6443 \
--deploy-mode cluster \
--name my-spark-app \
--conf spark.kubernetes.container.image=myrepo/spark-py:latest \
--conf spark.executor.instances=4 \
--conf spark.kubernetes.namespace=data \
local:///opt/spark/app.py
# Mesos (legacy example)
spark-submit \
--master mesos://zk://mesos-zk:2181/mesos \
--deploy-mode client \
app.py
Local/dev (code)
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.appName("ConnectDemo")
.master("local[4]") # local dev only
.getOrCreate())
EXPLANATION
Why “connecting to a cluster manager” matters
Spark needs a cluster manager to allocate resources (executors/cores/memory). You pick the manager with --master when you submit your job (or .master("local[...]") for laptop dev). The driver (your app) talks to the cluster manager, which launches executors that actually run your tasks.
Standalone (Spark’s built-in manager)
Command (annotated)
spark-submit \
--master spark://spark-master:7077 \ # URL of the Standalone "Master" node
--deploy-mode client \ # driver runs on your submitting machine
--conf spark.executor.instances=4 \ # request 4 executor JVMs
app.py
# your application file
What each piece does
--master spark://spark-master:7077
: tells Spark where the Standalone Master lives.
--deploy-mode client
: the driver runs where you launch spark-submit (logs print in your terminal).
(Tip: use cluster if you want the driver to run on a worker node.)
spark.executor.instances=4
: asks the manager for 4 executors (parallel workers).
app.py
: your PySpark program; the driver process executes it.
When to use
: Non-Hadoop clusters where you installed Spark’s Standalone master/worker daemons.
YARN (Hadoop clusters)
Command (annotated)
spark-submit \
--master yarn \ # use Hadoop YARN as cluster manager
--deploy-mode cluster \ # driver runs inside YARN (on a node in the cluster)
--conf spark.executor.instances=4 \ # 4 executors
--conf spark.executor.memory=4g \ # memory per executor
app.py
What each piece does
--master yarn
: integrate with Hadoop’s YARN RM.
--deploy-mode cluster
: the driver runs in the cluster; logs go to YARN (check YARN UI).
spark.executor.instances, spark.executor.memory
: resource sizing knobs.
Prereqs: HADOOP_CONF_DIR/YARN_CONF_DIR visible to Spark so it can talk to YARN; HDFS access if reading from HDFS.
Kubernetes (K8s)
spark-submit \
--master k8s://https://my-k8s-apiserver:6443 \ # K8s API endpoint
--deploy-mode cluster \ # driver runs in a K8s pod
--name my-spark-app \ # K8s app (driver) name
--conf spark.kubernetes.container.image=myrepo/spark-py:latest \ # driver+executor image
--conf spark.executor.instances=4 \ # 4 executor pods
--conf spark.kubernetes.namespace=data \ # submit into this K8s namespace
local:///opt/spark/app.py
# path inside the container image
What each piece does
--master k8s://...
: point Spark to your K8s API server.
--deploy-mode cluster: the driver is a pod; executors are pods.
spark.kubernetes.container.image: container image that already contains Spark and your code (or use --py-files to ship code).
local:///opt/spark/app.py: local:// means “file already present in the driver container at this path.”
(Use file:// for host paths, not recommended here; hdfs:///s3a:// for remote.)
Prereqs: image in a registry accessible by the cluster, correct service account/RBAC, and a valid kubeconfig (or in-cluster submission).
4) Mesos (legacy, rarely used today)
Command (annotated)
spark-submit \
--master mesos://zk://mesos-zk:2181/mesos \ # Mesos master from ZooKeeper
--deploy-mode client \ # driver on submit host
app.py
What each piece does
--master mesos://...: connects to Mesos; ZooKeeper provides HA master discovery.
Client vs cluster mode same semantics as others.
5) Local/dev code (no cluster—just your machine)
Code
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.appName("ConnectDemo")
.master("local[4]") # run locally with 4 worker threads
.getOrCreate())
What it means
.master("local[4]"): run Spark locally with 4 threads (great for unit tests and debugging).
In production, don’t hardcode .master(...); let spark-submit --master ... decide.
(If both are present, spark-submit settings typically win.)
Minimal end-to-end example (app.py)
Use this same file with ANY of the spark-submit commands above (Standalone/YARN/K8s). It prints where it connected, how many executors it sees, and does some parallel work so you can see tasks run.
app.py
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.appName("ClusterConnectDemo")
.getOrCreate())
sc = spark.sparkContext
--- Introspection: where am I running? ---
print("Master URL :", sc.master) # e.g., spark://..., yarn, k8s...
try:
app_id = sc._jsc.sc().applicationId()
except Exception:
app_id = "n/a"
print("Application ID :", app_id)
print("Default parallelism :", sc.defaultParallelism)
How many executors are alive (driver not guaranteed included)
try:
ems = sc._jsc.sc().getExecutorMemoryStatus()
print("Executor slots (map size):", ems.size())
except Exception:
pass
--- Actual work across partitions/tasks ---
rdd = sc.parallelize(range(1_000_000), numSlices=sc.defaultParallelism)
total = (rdd
.map(lambda x: x * x)
.reduce(lambda a, b: a + b))
print("Sum of squares :", total)
Check how many partitions were used
print("Partitions used :", rdd.getNumPartitions())
spark.stop()
How to run it
Standalone
spark-submit \
--master spark://spark-master:7077 \
--deploy-mode client \
--conf spark.executor.instances=4 \
app.py
YARN
spark-submit \
--master yarn \
--deploy-mode cluster \
--conf spark.executor.instances=4 \
--conf spark.executor.memory=4g \
app.py
Kubernetes (bake app.py into the image at /opt/spark/app.py)
spark-submit \
--master k8s://https://my-k8s-apiserver:6443 \
--deploy-mode cluster \
--name my-spark-app \
--conf spark.kubernetes.container.image=myrepo/spark-py:latest \
--conf spark.executor.instances=4 \
--conf spark.kubernetes.namespace=data \
local:///opt/spark/app.py
Top comments (0)