I'm using the Microsoft.Spark Spark API and applying GroupBy
to a DataFrame object. I would like to apply Agg
to multiple columns after grouping.
In pyspark
, I would express what I'm trying to accomplish with something like
new_df = df.groupBy("customer_id")
.agg(
func.mean("a").alias("Mean"),
func.stdev("a").alias("StDev")
)
# ...
Using the .NET API, I've set up the DataFrame but do not understand how to use .Agg
in an analogous way, e.g.:
var newData = dataFrame
.GroupBy("customer_id")
.Agg(dataFrame.Col("a")) // How to apply mean as the aggregate function to this column?
// ...
I see that the method parameters for DataFrame.Agg
are Column, Column[]
(https://docs.microsoft.com/en-us/dotnet/api/microsoft.spark.sql.dataframe.agg?view=spark-dotnet) - I'm not sure how I would express that I would like to GroupBy
a column name and apply aggregate functions on other columns in the DataFrame.
CodePudding user response:
I found that the static class Functions
in Microsoft.Spark.Sql
contains standard DataFrame functions that return a Column
object.
So perhaps something like the following is what I'm looking for:
var newData = df
.GroupBy("customer_id")
.Agg(
Functions.Count("col_a"),
Functions.Max("col_b"),
Functions.Sum("col_c")
);