How to group and count values in RDD to return a small summary using pyspark?-CodePudding

Some example data:

new_data = [{'name': 'Tom', 'subject': "maths", 'exam_score': 85},
            {'name': 'Tom', 'subject': "science", 'exam_score': 55},
            {'name': 'Tom', 'subject': "history", 'exam_score': 68},
            {'name': 'Ivy', 'subject': "maths", 'exam_score': 72},
            {'name': 'Ivy', 'subject': "science", 'exam_score': 67},
            {'name': 'Ivy', 'subject': "history", 'exam_score': 59},
            {'name': 'Ben', 'subject': "maths", 'exam_score': 56},
            {'name': 'Ben', 'subject': "science", 'exam_score': 51},
            {'name': 'Ben', 'subject': "history", 'exam_score': 63},
            {'name': 'Eve', 'subject': "maths", 'exam_score': 74},
            {'name': 'Eve', 'subject': "maths", 'exam_score': 87},
            {'name': 'Eve', 'subject': "maths", 'exam_score': 90}]

new_rdd = sc.parallelize(new_data)

Given that a student passes the exam if they score 60 or more.

I would like to return a Spark RDD which has name of student followed by the number of exams they pass (should be a number between 1 and 3)?

I'm assuming I would have to use groupByKey() and map() here?

The expected output should look something like:

# [('Tom', 2),
# ('Ivy', 2),
# ('Ben', 1),
# ('Eve', 3)]

CodePudding user response：

You can use filter() for the condition, and then a map() to keep the name as a key and use reduceByKey() to count the occurrences.

data_rdd. \
    filter(lambda r: r['exam_score'] >= 60). \
    map(lambda r: (r['name'], 1)). \
    reduceByKey(lambda x, y: x   y). \
    collect()

# [('Tom', 2), ('Ivy', 2), ('Ben', 1), ('Eve', 3)]

CodePudding user response：

Done :)

pass_rdd = new_rdd.filter(lambda my_dict: my_dict['exam_score'] >= 60).map(lambda my_dict: (my_dict['name'], 1))
print(pass_rdd.collect())
# [('Tom', 1), ('Tom', 1), ('Ivy', 1), ('Ivy', 1), ('Ben', 1), ('Eve', 1), ('Eve', 1), ('Eve', 1)]
from operator import add
count_rdd = pass_rdd.reduceByKey(add)
print(count_rdd.collect())
# [('Eve', 3), ('Tom', 2), ('Ivy', 2), ('Ben', 1)]