SparkContext creates RDDs from in-memory collections or external storage (HDFS, S3, local FS, Hadoop formats).
Examples
sc = spark.sparkContext
# From an in-memory collection
nums = sc.parallelize([1,2,3,4,5], numSlices=2)
# From text files (HDFS/S3/local)
rdd_hdfs = sc.textFile("hdfs://namenode:8020/data/logs/*.log")
rdd_s3 = sc.textFile("s3a://my-bucket/2025/07/*.json")
rdd_local = sc.textFile("file:///var/tmp/sample.txt")
# Common transforms/actions
even_squares = nums.map(lambda x: x*x).filter(lambda x: x % 2 == 0)
print("Even squares:", even_squares.collect())
pairs = sc.parallelize([("a",1), ("b",2), ("a",3)])
reduced = pairs.reduceByKey(lambda a,b: a+b)
print("Reduced:", reduced.collect())
- Create RDD from an In-Memory Collection
This example creates an RDD from an in-memory Python list.
sc = spark.sparkContext
nums = sc.parallelize([1, 2, 3, 4, 5], numSlices=2)
# Transform: square the numbers
squares = nums.map(lambda x: x * x)
# Action: collect the results
print("Squares:", squares.collect())
Data Source: In-memory Python list
Transform: map() to square numbers
Action: collect() to bring the data back to the driver
- Create RDD from a Text File on HDFS
This creates an RDD from a text file stored on HDFS.
rdd_hdfs = sc.textFile("hdfs://namenode:8020/data/logs/*.log")
# Action: count the number of lines
line_count = rdd_hdfs.count()
print(f"Line count in HDFS file: {line_count}")
Data Source: HDFS
Transform: None (just reading)
Action: count() to count the number of lines
- Create RDD from a Text File on S3
This creates an RDD from a file stored in S3.
rdd_s3 = sc.textFile("s3a://my-bucket/2025/07/*.json")
# Transform: map JSON lines to their lengths
json_line_lengths = rdd_s3.map(lambda line: len(line))
# Action: collect the lengths
print("JSON line lengths:", json_line_lengths.collect())
Data Source: S3 (using s3a protocol)
Transform: map() to calculate line lengths
Action: collect() to fetch the data
- Create RDD from a Local File System
This example reads from a local file.
rdd_local = sc.textFile("file:///var/tmp/sample.txt")
# Action: print the first 5 lines
print(rdd_local.take(5))
Data Source: Local file system
Transform: None (just reading)
Action: take() to get the first 5 lines
- Create RDD from a Python List of Pairs
You can create RDDs with key-value pairs.
pairs = sc.parallelize([("a", 1), ("b", 2), ("a", 3)])
# Transform: sum the values of the same key using reduceByKey
reduced = pairs.reduceByKey(lambda a, b: a + b)
# Action: collect the results
print("Reduced:", reduced.collect())
Data Source: In-memory list of pairs
Transform: reduceByKey() to sum values by key
Action: collect() to gather the results
- Create RDD from a Large Dataset in HDFS
This reads a large log file from HDFS and processes it.
rdd_logs = sc.textFile("hdfs://namenode:8020/data/logs/*.log")
# Transform: filter logs containing "error" and count them
error_logs = rdd_logs.filter(lambda line: "error" in line)
# Action: count the error logs
print(f"Number of error logs: {error_logs.count()}")
Data Source: HDFS (log files)
Transform: filter() to get only error logs
Action: count() to count the error logs
- Create RDD from Multiple Text Files on Local FS
This creates an RDD from multiple text files on the local file system.
rdd_multiple_files = sc.textFile("file:///tmp/*.txt")
# Transform: map each line to its length
line_lengths = rdd_multiple_files.map(lambda line: len(line))
# Action: collect the lengths
print("Line lengths from multiple files:", line_lengths.collect())
Data Source: Local file system (multiple files)
Transform: map() to calculate line lengths
Action: collect() to get the results
- Create RDD from an RDD Generated by Sampling
Sampling an existing RDD is useful when you want to process a smaller subset of data.
rdd_sample = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
# Transform: sample 50% of the data without replacement
sampled_rdd = rdd_sample.sample(False, 0.5)
# Action: collect the sample
print("Sampled RDD:", sampled_rdd.collect())
Data Source: In-memory collection
Transform: sample() to take a random sample of the data
Action: collect() to get the sampled data
- Create RDD from a CSV File on S3
This reads a CSV file stored on S3 and processes the data.
rdd_csv = sc.textFile("s3a://my-bucket/data.csv")
# Transform: split each line into a list
rdd_split = rdd_csv.map(lambda line: line.split(','))
# Action: print the first 5 rows
print(rdd_split.take(5))
Data Source: S3 (CSV file)
Transform: map() to split each row by commas
Action: take() to print the first 5 rows
- Create RDD from a Python Dictionary
You can create an RDD from an in-memory dictionary by converting it into a list of pairs.
data_dict = {"apple": 3, "banana": 2, "cherry": 5}
# Create an RDD from the dictionary (key-value pairs)
rdd_dict = sc.parallelize(data_dict.items())
# Transform: filter keys that start with "a"
filtered_dict = rdd_dict.filter(lambda x: x[0].startswith('a'))
# Action: collect and print the result
print("Filtered dictionary:", filtered_dict.collect())
Data Source: In-memory dictionary
Transform: filter() to select keys that start with "a"
Action: collect() to fetch the results
Summary of RDD Creation and Transformations/Actions:
RDD from Collections: parallelize() can be used to create RDDs from in-memory collections like lists, tuples, or dictionaries.
RDD from Files: textFile() allows you to create RDDs from text files located on HDFS, S3, or local file systems.
Common Transformations: map(), filter(), reduceByKey(), and sample() allow you to manipulate RDD data.
Actions: collect(), count(), take() are used to retrieve the results from the RDD transformations.
These examples demonstrate how you can create and manipulate RDDs using different data sources, and perform common transformations and actions in Apache Spark.
Top comments (0)