Home > front end >  PySpark-How to find out the top n most frequently occurring value in an array column?
PySpark-How to find out the top n most frequently occurring value in an array column?

Time:11-09

For the sample data below, wondering how I can find out the most frequently occurring value in the column colour. The data type of colour is WrappedArray. There could be n number of elements in the array. In this example the colour should be yellow, followed by blue which appeared twice. Many thanks for your help.

Name   Colour 
 A      ('blue','yellow')
 B      ('pink', 'yellow')
 C      ('green', 'black')
 D      ('yellow','orange','blue')

CodePudding user response:

I would explode the colour column and then simply run groupBy and count to get what you need.

df \
.select(explode('colour').alias('colour')) \
.groupBy('colour') \
.count() \
.orderBy(col('count').desc())
  • Related