Home > Back-end >  Frequencies of number in a list - Pyspark
Frequencies of number in a list - Pyspark

Time:12-02

I have this code that outputs a list of values:

ARDD.map(function_B) \
            .filter(lambda x: x is not None) \
            .take(6)

Output:

['2','10','2','12','3','3']

How can I change the code to get this output?

[2:2, 3:2, 10:1, 12:1]

CodePudding user response:

Use map and reduceByKey RDD methods:

rdd = spark.sparkContext.parallelize(['2', '10', '2', '12', '3', '3'])

rdd1 = rdd.map(lambda x: (x, 1)) \
    .reduceByKey(lambda a, b: a   b) \
    .map(lambda x: f"{x[0]}:{x[1]}")

print(rdd1.collect())
#['10:1', '12:1', '3:2', '2:2']
  • Related