Home > front end >  Get first value from consecutive rows PySpark
Get first value from consecutive rows PySpark

Time:11-16

I am using PySpark and I want to get the first status order by Date but only if there consecutive status, because the status can be more than once, but not more than one in a row. Here is my example of what i have:

status date
A 2022-10-10
B 2022-10-12
B 2022-10-13
C 2022-10-13
C 2022-10-14
B 2022-12-15
C 2022-12-16
D 2022-12-17
A 2022-12-18

This is what I need:

status date
A 2022-10-10
B 2022-10-12
C 2022-10-13
B 2022-12-15
C 2022-12-16
D 2022-12-17
A 2022-12-18

I think something like but dont know how to implement either:

when(row[status] == row[status]   1) then row[status]

Thank you for you help

CodePudding user response:

You can use a window function ordered by date - that's why the order of your dataframe matters - and compare whether in the following row you have the same status as before: you can do this with the lag function.

import pyspark.sql.functions as F
from pyspark.sql.window import Window

w = Window.orderBy('date')

df = (df
 .withColumn('status_lag', F.lag('status').over(w))
 .filter((F.col('status_lag') != F.col('status')) | (F.col('status_lag').isNull()))
 .drop('status_lag')
)

 ------ ---------- 
|status|      date|
 ------ ---------- 
|     A|2022-10-10|
|     B|2022-10-12|
|     C|2022-10-13|
|     B|2022-12-15|
|     C|2022-12-16|
|     D|2022-12-17|
|     A|2022-12-18|
 ------ ---------- 

CodePudding user response:

I just find the answer to the problem. I realize i have to add another column to do the partitionBy.

w  = Window.partitionBy('GP').orderBy("create_when")
df_1= df_0.withColumn("lag",F.lag("status").over(w))\
           .where((F.col("status") != F.col("lag")) | (F.col("lag").isNull()))
  • Related