Some example data:
new_data = [{'name': 'Tom', 'subject': "maths", 'exam_score': 85},
{'name': 'Tom', 'subject': "science", 'exam_score': 55},
{'name': 'Tom', 'subject': "history", 'exam_score': 68},
{'name': 'Ivy', 'subject': "maths", 'exam_score': 72},
{'name': 'Ivy', 'subject': "science", 'exam_score': 67},
{'name': 'Ivy', 'subject': "history", 'exam_score': 59},
{'name': 'Ben', 'subject': "maths", 'exam_score': 56},
{'name': 'Ben', 'subject': "science", 'exam_score': 51},
{'name': 'Ben', 'subject': "history", 'exam_score': 63},
{'name': 'Eve', 'subject': "maths", 'exam_score': 74},
{'name': 'Eve', 'subject': "maths", 'exam_score': 87},
{'name': 'Eve', 'subject': "maths", 'exam_score': 90}]
new_rdd = sc.parallelize(new_data)
Given that a student passes the exam if they score 60 or more.
I would like to return a Spark RDD which has name
of student followed by the number of exams they pass (should be a number between 1 and 3)?
I'm assuming I would have to use groupByKey()
and map()
here?
The expected output should look something like:
# [('Tom', 2),
# ('Ivy', 2),
# ('Ben', 1),
# ('Eve', 3)]
CodePudding user response:
You can use filter()
for the condition, and then a map()
to keep the name as a key and use reduceByKey()
to count the occurrences.
data_rdd. \
filter(lambda r: r['exam_score'] >= 60). \
map(lambda r: (r['name'], 1)). \
reduceByKey(lambda x, y: x y). \
collect()
# [('Tom', 2), ('Ivy', 2), ('Ben', 1), ('Eve', 3)]
CodePudding user response:
Done :)
pass_rdd = new_rdd.filter(lambda my_dict: my_dict['exam_score'] >= 60).map(lambda my_dict: (my_dict['name'], 1))
print(pass_rdd.collect())
# [('Tom', 1), ('Tom', 1), ('Ivy', 1), ('Ivy', 1), ('Ben', 1), ('Eve', 1), ('Eve', 1), ('Eve', 1)]
from operator import add
count_rdd = pass_rdd.reduceByKey(add)
print(count_rdd.collect())
# [('Eve', 3), ('Tom', 2), ('Ivy', 2), ('Ben', 1)]