Home > Software engineering >  Merge Spark Dataframe rows based on key
Merge Spark Dataframe rows based on key

Time:09-18

Using pivot in pyspark I was able to get the below values. Note that columns T1..T4 are dynamically generated from pivot output, therefore cannot predict if there will be more or less.

 -------------------- ----------- ---------------- ------------- ------------- 
|   ID               |T1         |          T2    | T3          |        T4   |
 -------------------- ----------- ---------------- ------------- ------------- 
|15964021641455171213|   0.000000|             0.0|        0E-10|23.1500000000|
|15964021641455171213|  55.560000|40.7440000000002|18.5200000000|        0E-10|
 -------------------- ----------- ---------------- ------------- ------------- 

Expected Result:

 -------------------- ----------- ---------------- ------------- ------------- 
|   ID               |T1         |          T2    | T3          |        T4   |
 -------------------- ----------- ---------------- ------------- ------------- 
|15964021641455171213|  55.560000|40.7440000000002|18.5200000000|23.1500000000|
 -------------------- ----------- ---------------- ------------- ------------- 

Any help is appreciated !

CodePudding user response:

The operation is a simple groupBy, with a sum as aggregation function. The main issue is here that the names and number of columns to be summed up are unknown. Therefore the aggregation columns have to be calculated dynamically:

from pyspark.sql import functions as F

df=...
non_id_cols=df.columns
non_id_cols.remove('ID')
summed_non_id_cols=[F.sum(c).alias(c) for c in non_id_cols]
df.groupBy('ID').agg(*summed_non_id_cols).show()
  • Related