Home > OS >  PySpark: how to group or collapse dataframe rows based on a common column
PySpark: how to group or collapse dataframe rows based on a common column

Time:08-25

I have a PySpark dataframe similar to:

  company     date     value  category
    ------------------------------------
      xyz    31-12-2020    12      
      xyz                          SAF
      abc                  11      
      abc    30-06-2020            AF
      jfk                          SAF
      jfk    30-09-2020    13      

I'm trying to group it in the following way:

  company     date     value  category
    ------------------------------------
      xyz    31-12-2020    12      SAF                                     
      abc    30-06-2020    11      AF
      jfk    30-09-2020    13      SAF

I already tried with:

df = df.groupBy("company",
                 "date",
                 "value",
                 "category").max()

But the result is not the expected, taking into account that I'm not looking to sum or aggregate any of the fields, just trying to "collapse" them according to the company column.

CodePudding user response:

Assuming your missing value in the dataframe is Null, you can use .max(), just change it to:

df.groupby("company").agg(func.max("date").alias("date"), func.max("value").alias("value"), func.max("category").alias("category")).show(10)
  • Related