Get first value from consecutive rows PySpark-CodePudding

I am using PySpark and I want to get the first status order by Date but only if there consecutive status, because the status can be more than once, but not more than one in a row. Here is my example of what i have:

status	date
A	2022-10-10
B	2022-10-12
B	2022-10-13
C	2022-10-13
C	2022-10-14
B	2022-12-15
C	2022-12-16
D	2022-12-17
A	2022-12-18

This is what I need:

status	date
A	2022-10-10
B	2022-10-12
C	2022-10-13
B	2022-12-15
C	2022-12-16
D	2022-12-17
A	2022-12-18

I think something like but dont know how to implement either:

when(row[status] == row[status]   1) then row[status]

Thank you for you help

CodePudding user response：

You can use a window function ordered by date - that's why the order of your dataframe matters - and compare whether in the following row you have the same status as before: you can do this with the lag function.

import pyspark.sql.functions as F
from pyspark.sql.window import Window

w = Window.orderBy('date')

df = (df
 .withColumn('status_lag', F.lag('status').over(w))
 .filter((F.col('status_lag') != F.col('status')) | (F.col('status_lag').isNull()))
 .drop('status_lag')
)

 ------ ---------- 
|status|      date|
 ------ ---------- 
|     A|2022-10-10|
|     B|2022-10-12|
|     C|2022-10-13|
|     B|2022-12-15|
|     C|2022-12-16|
|     D|2022-12-17|
|     A|2022-12-18|
 ------ ----------

CodePudding user response：

I just find the answer to the problem. I realize i have to add another column to do the partitionBy.

w  = Window.partitionBy('GP').orderBy("create_when")
df_1= df_0.withColumn("lag",F.lag("status").over(w))\
           .where((F.col("status") != F.col("lag")) | (F.col("lag").isNull()))