I need to update some columns based on unique ticket id(groupby) on following conditions and return the particular record:
1.Whenever the status is closed - that particular record run_date only needs to get update in closed_time column of the closed status record based on unique ticket id.
2.Whenever the status is In-progress - that particular record run_date only needs to get update in inprogress_time column of the closed status record based on unique ticket id.(Only run_date will get update in the inprogress_time of the closed status record).
3.Whenever the status is cancelled -that particular record run_date only needs to get update in cancelled_time column of the cancelled status record based on unique ticket id.
INPUT DATAFRAME
Id type inprogress_time closed_time cancelled_time status Source_system Run_date
11 TRUCK NAN NAN NAN Created LIBERATE 1/9/2021 12:00
11 TRUCK NAN NAN NAN In_Progress LIBERATE 1/9/2021 12:00
11 TRUCK NAN NAN NAN Closed LIBERATE 8/9/2021 19:21
22 TRUCK NAN NAN NAN Cancelled LIBERATE 3/9/2021 15:08
33 TRUCK NAN NAN NAN Created LIBERATE 4/10/2021 15:08
33 TRUCK NAN NAN NAN In_Progress LIBERATE 4/10/2021 15:08
33 TRUCK NAN NAN NAN Closed LIBERATE 5/10/2021 15:08
EXPECTED RESULT(OUTPUT DATAFRAME)
Id type inprogress_time closed_time cancelled_time status Source_system run_date
11 TRUCK 1/9/2021 12:00 8/9/2021 19:21 NAN Closed LIBERATE 8/9/2021 19:21
22 TRUCK NAN NAN 3/9/2021 15:08 Cancelled LIBERATE 3/9/2021 15:08
33 TRUCK 4/10/2021 15:08 5/10/2021 15:08 NAN Closed LIBERATE 5/10/2021 15:08
CodePudding user response:
I think pivot
is more efficient.
df.groupBy('Id', 'type', 'Source_system').pivot('status').agg(f.first('Run_date')) \
.withColumnRenamed('Cancelled', 'cancelled_time') \
.withColumnRenamed('Closed', 'closed_time') \
.withColumnRenamed('In_Progress', 'inprogress_time') \
.drop('Created') \
.withColumn('status', f.expr('''
CASE WHEN cancelled_time is not null THEN 'Cancelled'
WHEN closed_time is not null THEN 'Closed'
WHEN inprogress_time is not null THEN 'In_Progress'
ELSE 'Created' END ''')) \
.show(truncate=False)
--- ----- ------------- -------------- --------------- --------------- ---------
|Id |type |Source_system|cancelled_time|closed_time |inprogress_time|status |
--- ----- ------------- -------------- --------------- --------------- ---------
|33 |TRUCK|LIBERATE |null |5/10/2021 15:08|4/10/2021 15:08|Closed |
|11 |TRUCK|LIBERATE |null |8/9/2021 19:21 |1/9/2021 12:00 |Closed |
|22 |TRUCK|LIBERATE |3/9/2021 15:08|null |null |Cancelled|
--- ----- ------------- -------------- --------------- --------------- ---------