How to sort an RDD after using countByKey() in PySpark-CodePudding

I have an RDD where I have used countByvalue() to count the frequency of job types within the data. This has outputted it in key pairs with (jobType, frequency) i believe.

freq_per_job=previous_val.map(lambda x:x[3]).countByValue()

Here lambda is used to map the job types which is in [3] position.

I then wish to total the job types up and output the top 10 job types, however i cant seem to do this. I have tried using sortByKey(false) however i keep getting the following error:

AttributeError: 'collections.defaultdict' object has no attribute 'sortByKey'

I am new to pyspark so i am not sure how to go about fixing this.

CodePudding user response：

Hi it is not working as countByValue returns a dictionary instead of an RDD while sortByKey etc is an RDD function so there are 2 ways for having RDD

convert the count by value dictionary back to RDD (not advisable for large data as it is collected in driver):

freq_per_job=previous_val.map(lambda x:x[3]).countByValue()
freq_per_job_rdd=sc.parallelize(list(freq_per_job.countByValue().items()))
freq_per_job_rdd.sortByKey().collect()

Use map and reduceByKey then use sortByKey:

freq_per_job=previous_val.map(lambda x:x[3]).map(lambda x :(x,1)).reduceByKey(_ _).sortByKey().collect()

Also if you just want top 10 records based on count of job then instead of sortbykey you can use top with key

freq_per_job=previous_val.map(lambda x:x[3]).map(lambda x :(x,1)).reduceByKey(_ _).top(10,key=lambda x: x[1]).collect()