How to obtain the most repeat value in Array Type in Pyspark?-CodePudding

I have a pyspark data frame like the follow:

columns = ["id","values"]
data = [("sample1", ["a","b","a"]), ("sample2", ["b","b","a","c"])]
dataframe = spark.sparkContext.parallelize(data)

source

 ------- -------------------- 
|     id|              values|
 ------- -------------------- 
|sample1|       ["a","b","a"]|
|sample2|   ["b","b","a","c"]|
 ------- --------------------

I would like build a column with the most common value in the array and obtain a dataframe like the follow:

 ------- -------------------- --------- 
|     id|              values|   common|
 ------- -------------------- --------- 
|sample1|       ["a","b","a"]|      "a"|
|sample2|   ["b","b","a","c"]|      "b"|
 ------- -------------------- ---------

CodePudding user response：

You can explode the array values the group by to count occurences of each value and use Window to filter the value with max count :

from pyspark.sql import Window
import pyspark.sql.functions as F

df1 = df.withColumn(
    "common",
    F.explode("values")
).groupBy("id", "values", "common").count().withColumn(
    "rn",
    F.row_number().over(Window.partitionBy("id", "values").orderBy(F.col("count").desc()))
).filter("rn = 1").drop("rn", "count")

df1.show()
# ------- ------------ ------ 
#|id     |values      |common|
# ------- ------------ ------ 
#|sample1|[a, b, a]   |a     |
#|sample2|[b, b, a, c]|b     |
# ------- ------------ ------

Another way without using explode is to do it with higher-order functions transform and filter along with some array functions:

df1 = df.withColumn(
    "common",
    F.array_max(
        F.expr("""transform(
                    array_distinct(values), 
                    x -> struct(
                            size(filter(values, y -> y = x)) as count, 
                            x as value
                        )
                )""")
    )["value"]
)