PYSPARK : Update value in a row with another row value based on condition?-CodePudding

I need to update some columns based on unique ticket id(groupby) on following conditions and return the particular record:

1.Whenever the status is closed - that particular record run_date only needs to get update in closed_time column of the closed status record based on unique ticket id.

2.Whenever the status is In-progress - that particular record run_date only needs to get update in inprogress_time column of the closed status record based on unique ticket id.(Only run_date will get update in the inprogress_time of the closed status record).

3.Whenever the status is cancelled -that particular record run_date only needs to get update in cancelled_time column of the cancelled status record based on unique ticket id.

INPUT DATAFRAME

Id  type  inprogress_time  closed_time  cancelled_time  status    Source_system   Run_date
11  TRUCK   NAN            NAN           NAN            Created     LIBERATE    1/9/2021 12:00
11  TRUCK   NAN            NAN           NAN            In_Progress LIBERATE    1/9/2021 12:00
11  TRUCK   NAN            NAN           NAN            Closed      LIBERATE    8/9/2021 19:21
22  TRUCK   NAN            NAN           NAN            Cancelled   LIBERATE    3/9/2021 15:08
33  TRUCK   NAN            NAN           NAN            Created     LIBERATE    4/10/2021 15:08
33  TRUCK   NAN            NAN           NAN            In_Progress LIBERATE    4/10/2021 15:08
33  TRUCK   NAN            NAN           NAN            Closed      LIBERATE    5/10/2021 15:08

EXPECTED RESULT(OUTPUT DATAFRAME)

Id  type    inprogress_time  closed_time    cancelled_time     status       Source_system  run_date
11  TRUCK   1/9/2021 12:00  8/9/2021 19:21   NAN               Closed        LIBERATE  8/9/2021 19:21
22  TRUCK   NAN             NAN              3/9/2021 15:08    Cancelled     LIBERATE  3/9/2021 15:08                 
33  TRUCK   4/10/2021 15:08 5/10/2021 15:08  NAN               Closed        LIBERATE  5/10/2021 15:08

CodePudding user response：

I think pivot is more efficient.

df.groupBy('Id', 'type', 'Source_system').pivot('status').agg(f.first('Run_date')) \
  .withColumnRenamed('Cancelled',   'cancelled_time') \
  .withColumnRenamed('Closed',      'closed_time') \
  .withColumnRenamed('In_Progress', 'inprogress_time') \
  .drop('Created') \
  .withColumn('status', f.expr('''
      CASE WHEN cancelled_time  is not null THEN 'Cancelled'
           WHEN closed_time     is not null THEN 'Closed'
           WHEN inprogress_time is not null THEN 'In_Progress'
      ELSE 'Created' END ''')) \
  .show(truncate=False)

 --- ----- ------------- -------------- --------------- --------------- --------- 
|Id |type |Source_system|cancelled_time|closed_time    |inprogress_time|status   |
 --- ----- ------------- -------------- --------------- --------------- --------- 
|33 |TRUCK|LIBERATE     |null          |5/10/2021 15:08|4/10/2021 15:08|Closed   |
|11 |TRUCK|LIBERATE     |null          |8/9/2021 19:21 |1/9/2021 12:00 |Closed   |
|22 |TRUCK|LIBERATE     |3/9/2021 15:08|null           |null           |Cancelled|
 --- ----- ------------- -------------- --------------- --------------- ---------