I have a pyspark data frame like the follow:
columns = ["id","values"]
data = [("sample1", ["a","b","a"]), ("sample2", ["b","b","a","c"])]
dataframe = spark.sparkContext.parallelize(data)
source
------- --------------------
| id| values|
------- --------------------
|sample1| ["a","b","a"]|
|sample2| ["b","b","a","c"]|
------- --------------------
I would like build a column with the most common value in the array and obtain a dataframe like the follow:
------- -------------------- ---------
| id| values| common|
------- -------------------- ---------
|sample1| ["a","b","a"]| "a"|
|sample2| ["b","b","a","c"]| "b"|
------- -------------------- ---------
CodePudding user response:
You can explode the array values
the group by to count occurences of each value and use Window to filter the value with max count :
from pyspark.sql import Window
import pyspark.sql.functions as F
df1 = df.withColumn(
"common",
F.explode("values")
).groupBy("id", "values", "common").count().withColumn(
"rn",
F.row_number().over(Window.partitionBy("id", "values").orderBy(F.col("count").desc()))
).filter("rn = 1").drop("rn", "count")
df1.show()
# ------- ------------ ------
#|id |values |common|
# ------- ------------ ------
#|sample1|[a, b, a] |a |
#|sample2|[b, b, a, c]|b |
# ------- ------------ ------
Another way without using explode is to do it with higher-order functions transform
and filter
along with some array functions:
df1 = df.withColumn(
"common",
F.array_max(
F.expr("""transform(
array_distinct(values),
x -> struct(
size(filter(values, y -> y = x)) as count,
x as value
)
)""")
)["value"]
)