I have a DataFrame with hundreds of feature columns, like this:
Country | ID | Feature_1 | Feature_2 | Feature_3 |....
US 123 1 5 0
US 456 0 10 1
CA 789 0 6 1
CA 999 0 3 0
...
I want to perform a group by on Country, then take the sum per feature, so I should end up with something like this:
Country | Feature_1 | Feature_2 | Feature_3 | ....
US 1 15 1
CA 0 9 1
How can I efficiently compute the aggregate sum function for all of my hundreds of features? I know for one feature, it's like this:
df.groupBy("Country").sum("Feature_1)
CodePudding user response:
you can try
import org.apache.spark.sql.functions._
val df = Seq(
("a", 9, 1),
("a", 4, 2),
("b", 1, 3),
("a", 1, 4),
("b", 2, 5)
).toDF("name", "x", "y")
df.groupBy(col("name"))
.agg(
sum(col("x")).as("xsum"),
sum(col("y")).as("ysum")
)
.show(false)
If you want to make it dynamic:
var exprs:List[Column] = List()
for(col <- List[String]("x", "y")){
exprs : = expr(s"sum($col) as sum_$col")
}
df.groupBy(col("name"))
.agg(
exprs.head, exprs.tail:_*
)
.show(false)