I have org.apache.spark.sql.DataFrame = [_1: string, _2: string]
with column containing the value in below format.
Input
_1 | _2 |
---|---|
1 | 1,2,3,4 |
11 | 6,7,11,13 |
15 | 22 |
20 | 5,10,21,22,28,32 |
I want to find the average
from column _2
and also count
the number of elements in it as shown below. What will be the most efficient way since this logic will be applied to millions of rows
Output
_1 | _2 | Average_2 | Count_2 |
---|---|---|---|
1 | 1,2,3,4 | 2.5 | 4 |
11 | 6,7,11,13 | 9.25 | 4 |
15 | 22 | 22 | 1 |
20 | 5,10,21,22,28,32 | 19.66 | 6 |
CodePudding user response:
Split the string column then use array functions size
aggregate
:
val df2 = df
.withColumn("arr_2", split($"_2", ","))
.withcolumn("Count_2", size($"arr_2"))
.withcolumn(
"Average_2",
aggregate($"arr_2", lit(0), (s, x) => s x, s => s / $"Count_2")
).drop("arr_2")