Debug School

rakesh kumar
rakesh kumar

Posted on

List out different transformation and action used in Spark RDD

In Apache Spark, which follows the functional programming paradigm, operations like
map()
filter()
reduce()
reduceByKey()
flatMap
are used to transform data in a functional way.

Lambda Functions (Anonymous Functions)

# Lambda function that adds 10 to a given number
add_ten = lambda x: x + 10
print(add_ten(5))  # Output: 15
Enter fullscreen mode Exit fullscreen mode

map()

rdd.map(lambda x: x * x)
===========other way=============
def square(x):
    return x * x
rdd.map(square)
Enter fullscreen mode Exit fullscreen mode
# map() example in Python
numbers = [1, 2, 3, 4, 5]
# Using a lambda function to square each number
squared_numbers = map(lambda x: x ** 2, numbers)

print(list(squared_numbers))  # Output: [1, 4, 9, 16, 25]
Enter fullscreen mode Exit fullscreen mode
rdd = sc.parallelize([1, 2, 3, 4, 5])
result = rdd.map(lambda x: x ** 2)
print(result.collect())  # Output: [1, 4, 9, 16, 25]
Enter fullscreen mode Exit fullscreen mode

filter() Function: Filters elements based on a function that returns a boolean value. It’s used to remove unwanted data, retaining only elements that match a specific condition.

# filter() example in Python
numbers = [1, 2, 3, 4, 5, 6]
# Using a lambda function to filter even numbers
even_numbers = filter(lambda x: x % 2 == 0, numbers)

print(list(even_numbers))  # Output: [2, 4, 6]
Enter fullscreen mode Exit fullscreen mode

In Spark, filter() works similarly to remove unwanted elements from an RDD:

rdd = sc.parallelize([1, 2, 3, 4, 5, 6])
even_rdd = rdd.filter(lambda x: x % 2 == 0)
print(even_rdd.collect())  # Output: [2, 4, 6]
Enter fullscreen mode Exit fullscreen mode
  1. reduce() Function: Performs a rolling computation over the items of an iterable and returns a single result. It’s often used for aggregating data. from functools import reduce
# reduce() example in Python: Sum of all elements
numbers = [1, 2, 3, 4, 5]
sum_result = reduce(lambda x, y: x + y, numbers)

print(sum_result)  # Output: 15
Enter fullscreen mode Exit fullscreen mode

In Spark, reduceByKey() is commonly used to aggregate values by key, where the transformation happens on a per-key basis:

rdd = sc.parallelize([("a", 1), ("b", 2), ("a", 3), ("b", 4)])
reduced_rdd = rdd.reduceByKey(lambda x, y: x + y)
print(reduced_rdd.collect())  # Output: [('a', 4), ('b', 6)]
Enter fullscreen mode Exit fullscreen mode
  1. reduceByKey() in Spark: Aggregates values with the same key using a provided function.
# Example in Spark-like fashion
rdd = sc.parallelize([('apple', 1), ('banana', 2), ('apple', 3), ('banana', 4)])

# Reduce by key (summing values for each key)
reduced_rdd = rdd.reduceByKey(lambda x, y: x + y)
print(reduced_rdd.collect())  # Output: [('banana', 6), ('apple', 4)]
Enter fullscreen mode Exit fullscreen mode
  1. flatMap() Function: Similar to map(), but it can return zero or more output items for each input item. It's useful when you need to flatten a structure (like a list of lists) into a single list.
# flatMap() example in Python using list comprehension
numbers = [1, 2, 3]
result = list(map(lambda x: [x, x * 2], numbers))  # Nested lists
flattened_result = [item for sublist in result for item in sublist]  # Flattening
print(flattened_result)  # Output: [1, 2, 2, 4, 3, 6]
Enter fullscreen mode Exit fullscreen mode

In Spark, flatMap() allows for emitting a variable number of output elements for each input element:

rdd = sc.parallelize([1, 2, 3])
flat_map_rdd = rdd.flatMap(lambda x: [x, x * 2])
print(flat_map_rdd.collect())  # Output: [1, 2, 2, 4, 3, 6]
Enter fullscreen mode Exit fullscreen mode
  1. Example Combining map(), filter(), and reduce():
A typical Spark-style transformation that involves chaining map(), filter(), and reduceByKey() to process data.

rdd = sc.parallelize([('apple', 1), ('banana', 2), ('apple', 3), ('banana', 4), ('orange', 1)])

# Step 1: Filter fruits with a name starting with 'a'
filtered_rdd = rdd.filter(lambda x: x[0].startswith('a'))

# Step 2: Map to change values (multiplying the fruit count by 10)
mapped_rdd = filtered_rdd.map(lambda x: (x[0], x[1] * 10))

# Step 3: Reduce by key (sum the counts)
reduced_rdd = mapped_rdd.reduceByKey(lambda x, y: x + y)

print(reduced_rdd.collect())  # Output: [('apple', 40)]
Enter fullscreen mode Exit fullscreen mode

Key Points:

Lambda Functions: Short, anonymous functions used for quick transformations.

map(): Transforms each element in an iterable.

filter(): Filters elements based on a condition.

reduce() / reduceByKey(): Aggregates elements into a single result.

flatMap(): Flattens a collection of lists into a single list.

Top comments (0)