Home > other >  Microsoft Spark aggregate method
Microsoft Spark aggregate method

Time:10-04

I'm using the Microsoft.Spark Spark API and applying GroupBy to a DataFrame object. I would like to apply Agg to multiple columns after grouping.

In pyspark, I would express what I'm trying to accomplish with something like

  new_df = df.groupBy("customer_id")
  .agg(
     func.mean("a").alias("Mean"),
     func.stdev("a").alias("StDev")
   )
# ...

Using the .NET API, I've set up the DataFrame but do not understand how to use .Agg in an analogous way, e.g.:

var newData = dataFrame
   .GroupBy("customer_id")
   .Agg(dataFrame.Col("a")) // How to apply mean as the aggregate function to this column?
// ...

I see that the method parameters for DataFrame.Agg are Column, Column[] (https://docs.microsoft.com/en-us/dotnet/api/microsoft.spark.sql.dataframe.agg?view=spark-dotnet) - I'm not sure how I would express that I would like to GroupBy a column name and apply aggregate functions on other columns in the DataFrame.

CodePudding user response:

I found that the static class Functions in Microsoft.Spark.Sql contains standard DataFrame functions that return a Column object.

So perhaps something like the following is what I'm looking for:

      var newData = df
          .GroupBy("customer_id")
          .Agg(
              Functions.Count("col_a"),
              Functions.Max("col_b"),
              Functions.Sum("col_c")
          );
  • Related