I am using PySpark and I want to get the first status order by Date but only if there consecutive status, because the status can be more than once, but not more than one in a row. Here is my example of what i have:
status | date |
---|---|
A | 2022-10-10 |
B | 2022-10-12 |
B | 2022-10-13 |
C | 2022-10-13 |
C | 2022-10-14 |
B | 2022-12-15 |
C | 2022-12-16 |
D | 2022-12-17 |
A | 2022-12-18 |
This is what I need:
status | date |
---|---|
A | 2022-10-10 |
B | 2022-10-12 |
C | 2022-10-13 |
B | 2022-12-15 |
C | 2022-12-16 |
D | 2022-12-17 |
A | 2022-12-18 |
I think something like but dont know how to implement either:
when(row[status] == row[status] 1) then row[status]
Thank you for you help
CodePudding user response:
You can use a window function ordered by date - that's why the order of your dataframe matters - and compare whether in the following row you have the same status as before: you can do this with the lag
function.
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window.orderBy('date')
df = (df
.withColumn('status_lag', F.lag('status').over(w))
.filter((F.col('status_lag') != F.col('status')) | (F.col('status_lag').isNull()))
.drop('status_lag')
)
------ ----------
|status| date|
------ ----------
| A|2022-10-10|
| B|2022-10-12|
| C|2022-10-13|
| B|2022-12-15|
| C|2022-12-16|
| D|2022-12-17|
| A|2022-12-18|
------ ----------
CodePudding user response:
I just find the answer to the problem. I realize i have to add another column to do the partitionBy.
w = Window.partitionBy('GP').orderBy("create_when")
df_1= df_0.withColumn("lag",F.lag("status").over(w))\
.where((F.col("status") != F.col("lag")) | (F.col("lag").isNull()))