I have an RDD where I have used countByvalue() to count the frequency of job types within the data. This has outputted it in key pairs with (jobType, frequency) i believe.
freq_per_job=previous_val.map(lambda x:x[3]).countByValue()
Here lambda is used to map the job types which is in [3] position.
I then wish to total the job types up and output the top 10 job types, however i cant seem to do this. I have tried using sortByKey(false) however i keep getting the following error:
AttributeError: 'collections.defaultdict' object has no attribute 'sortByKey'
I am new to pyspark so i am not sure how to go about fixing this.
CodePudding user response:
Hi it is not working as countByValue returns a dictionary instead of an RDD while sortByKey etc is an RDD function so there are 2 ways for having RDD
- convert the count by value dictionary back to RDD (not advisable for large data as it is collected in driver):
freq_per_job=previous_val.map(lambda x:x[3]).countByValue()
freq_per_job_rdd=sc.parallelize(list(freq_per_job.countByValue().items()))
freq_per_job_rdd.sortByKey().collect()
- Use map and reduceByKey then use sortByKey:
freq_per_job=previous_val.map(lambda x:x[3]).map(lambda x :(x,1)).reduceByKey(_ _).sortByKey().collect()
Also if you just want top 10 records based on count of job then instead of sortbykey you can use top with key
freq_per_job=previous_val.map(lambda x:x[3]).map(lambda x :(x,1)).reduceByKey(_ _).top(10,key=lambda x: x[1]).collect()