Home > Back-end >  Scala spark multiple values in a column average them
Scala spark multiple values in a column average them

Time:08-12

I have org.apache.spark.sql.DataFrame = [_1: string, _2: string] with column containing the value in below format.

Input

_1 _2
1 1,2,3,4
11 6,7,11,13
15 22
20 5,10,21,22,28,32

I want to find the average from column _2 and also count the number of elements in it as shown below. What will be the most efficient way since this logic will be applied to millions of rows

Output

_1 _2 Average_2 Count_2
1 1,2,3,4 2.5 4
11 6,7,11,13 9.25 4
15 22 22 1
20 5,10,21,22,28,32 19.66 6

CodePudding user response:

Split the string column then use array functions size aggregate:

val df2 = df
  .withColumn("arr_2", split($"_2", ","))
  .withcolumn("Count_2", size($"arr_2"))
  .withcolumn(
    "Average_2",
    aggregate($"arr_2", lit(0), (s, x) => s   x, s => s / $"Count_2")
  ).drop("arr_2")
  • Related