Debug School

rakesh kumar
rakesh kumar

Posted on • Edited on

How to coordinates distributed operations in Spark Core using SparkContext object

SparkContext Task
Creating a SparkContext
Difference between SparkContext  and spark session

In Spark’s architecture, the SparkContext object connects your driver program to a Spark cluster, manages jobs, and hands out tasks to worker nodes.

It’s the primary object in Spark Core that allows your application to talk to the Spark cluster.

SparkContext Task

Connecting to the cluster manager (Standalone, YARN, Mesos, Kubernetes)Read More

Coordinating executors and task schedulingRead More

Creating RDDs from data sources or collectionsRead More

Managing configuration and job execution

Providing access to services like the Hadoop FileSystem, broadcast variables, and accumulators.

It’s the main object that creates and coordinates distributed operations in Spark Core.

Handles: cluster communication, job scheduling, resource management, and much more.

Every Spark application has (or starts with) a SparkContext.

Read more


Read more

Creating a SparkContext

Note
In modern Spark (2.x+), we usually create it via SparkSession, but you can still work with it directly.

from pyspark import SparkConf, SparkContext

# Create Spark configuration
conf = SparkConf() \
    .setAppName("SparkContextExample") \
    .setMaster("local[2]")  # local mode with 2 threads

# Initialize SparkContext
sc = SparkContext(conf=conf)

# --- Using SparkContext ---

# Create RDD from a Python list
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)

# Transformation + Action
squares = rdd.map(lambda x: x * x)
print("Squares:", squares.collect())

# Read a text file into RDD
text_rdd = sc.textFile("file:///tmp/sample.txt")
print("Number of lines:", text_rdd.count())

# Broadcast variable
broadcast_val = sc.broadcast([10, 20, 30])
print("Broadcast value:", broadcast_val.value)

# Accumulator
accum = sc.accumulator(0)
rdd.foreach(lambda x: accum.add(x))
print("Accumulator sum:", accum.value)

# Stop SparkContext
sc.stop()
Enter fullscreen mode Exit fullscreen mode

Explanation

Set up Spark in local mode

conf = SparkConf().setAppName("SparkContextExample").setMaster("local[2]")
sc = SparkContext(conf=conf)
Enter fullscreen mode Exit fullscreen mode

local[2] = run Spark on your single machine using 2 worker threads (like 2 tiny executors). Great for learning/tests.

Driver creates a SparkContext (entry point) and talks to the two worker threads.

Build an RDD from a Python list

data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
Enter fullscreen mode Exit fullscreen mode

parallelize slices the list into partitions so tasks can run in parallel (with local[2], two tasks at a time).

Transform then trigger an action

squares = rdd.map(lambda x: x * x)       # transformation (lazy)
print("Squares:", squares.collect())     # action (executes)
Enter fullscreen mode Exit fullscreen mode

Transformations (like map) are lazy: Spark builds a plan (DAG) but does nothing yet.

collect() is an action: it triggers execution, runs tasks on the workers, and returns results to the driver.

Expected output: Squares: [1, 4, 9, 16, 25].

Read a text file and count lines

text_rdd = sc.textFile("file:///tmp/sample.txt")
print("Number of lines:", text_rdd.count())
Enter fullscreen mode Exit fullscreen mode

textFile creates an RDD where each element is a line.

count() is an action; it returns how many lines exist.

If you saw Number of lines: 3, it means the file has 3 newline-separated lines. (Exactly how many lines are in /tmp/sample.txt on your machine.)

Note: In cluster mode, file:// must point to a path visible to each worker; in local[...] you’re fine.

Broadcast variable (read-only, shipped once to workers)

broadcast_val = sc.broadcast([10, 20, 30])
print("Broadcast value:", broadcast_val.value)
Enter fullscreen mode Exit fullscreen mode

What it is: a small, read-only object that Spark sends once to every executor/worker.

Why: avoids resending the same data with every task or doing an expensive join.

In your snippet you just print it; a real use would be:

bc = sc.broadcast(set(["a","an","the"]))
words = text_rdd.flatMap(lambda line: line.split())
clean = words.filter(lambda w: w not in bc.value)
Enter fullscreen mode Exit fullscreen mode

Output here: Broadcast value: [10, 20, 30].

Accumulator (driver-readable counter for metrics)

accum = sc.accumulator(0)
rdd.foreach(lambda x: accum.add(x))
print("Accumulator sum:", accum.value)
Enter fullscreen mode Exit fullscreen mode

What it is: a write-only (for workers) / read-only (for driver) counter. Good for metrics (e.g., “bad rows seen”).

foreach runs on the workers; each element adds to the accumulator.

For your data [1,2,3,4,5], the sum becomes 15.

Output: Accumulator sum: 15.

Tips:

Accumulator values are reliable after an action runs.

Because tasks can retry, accumulators are for monitoring, not core business logic.

Modern API prefers sc.longAccumulator("name") over sc.accumulator(0).

Clean shutdown

sc.stop()
Enter fullscreen mode Exit fullscreen mode

Releases resources and closes the Spark app

Summary


Quick expected output (if /tmp/sample.txt has 3 lines)

Squares: [1, 4, 9, 16, 25]
Number of lines: 3
Broadcast value: [10, 20, 30]
Accumulator sum: 15
Enter fullscreen mode Exit fullscreen mode

Difference between SparkContext  and spark session

SparkContext is the original entry point to Apache Spark (pre–Spark 2.0).

SparkSession was introduced in Spark 2.0 to unify SparkContext, SQLContext, and HiveContext into a single entry point.


SparkSession
     |
     --> SparkContext (for RDD & low-level cluster interaction)
     --> SQLContext   (for DataFrame & SQL operations)
Enter fullscreen mode Exit fullscreen mode

Old style (pre–2.0)

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext

conf = SparkConf().setAppName("OldWay").setMaster("local[2]")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

# RDD
rdd = sc.parallelize([1, 2, 3])
print(rdd.collect())

# DataFrame
df = sqlContext.read.json("people.json")
df.show()
Enter fullscreen mode Exit fullscreen mode

Modern style (2.0+)

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("NewWay") \
    .master("local[2]") \
    .getOrCreate()

# DataFrame
df = spark.read.json("people.json")
df.show()

# RDD (via SparkContext inside SparkSession)
rdd = spark.sparkContext.parallelize([1, 2, 3])
print(rdd.collect())
Enter fullscreen mode Exit fullscreen mode

Using SparkContext from SparkSession
In all modern PySpark code, you use SparkSession, but you can always get the underlying SparkContext:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MyApp").getOrCreate()
sc = spark.sparkContext  # Access the SparkContext if needed

# Use sc for RDD work
data = [10, 20, 30]
rdd = sc.parallelize(data)
print(rdd.collect())
Enter fullscreen mode Exit fullscreen mode

When to Use

Use SparkSession for all new Spark applications — it gives you everything:
✅ DataFrame API
✅ SQL queries
✅ RDD API via .sparkContext
✅ Catalog & Hive integration

Use SparkContext directly only when:

You are working in very old Spark code (pre–2.0)

You specifically want only the low-level RDD API without higher-level features

Quick Analogy

SparkContext → like the engine of a car (makes the cluster run, handles RDDs)

SparkSession → like the dashboard that controls the engine and gives extra features like GPS, media, AC, etc.

Top comments (0)