I have a PySpark dataframe similar to:
company date value category
------------------------------------
xyz 31-12-2020 12
xyz SAF
abc 11
abc 30-06-2020 AF
jfk SAF
jfk 30-09-2020 13
I'm trying to group it in the following way:
company date value category
------------------------------------
xyz 31-12-2020 12 SAF
abc 30-06-2020 11 AF
jfk 30-09-2020 13 SAF
I already tried with:
df = df.groupBy("company",
"date",
"value",
"category").max()
But the result is not the expected, taking into account that I'm not looking to sum or aggregate any of the fields, just trying to "collapse" them according to the company column.
CodePudding user response:
Assuming your missing value in the dataframe is Null
, you can use .max()
, just change it to:
df.groupby("company").agg(func.max("date").alias("date"), func.max("value").alias("value"), func.max("category").alias("category")).show(10)