How to get next earliest date from another categorical value (within the same column) in PySpark?-CodePudding

I'm not sure my question title makes much sense so I shall explain with an example below.

I have a dataframe (in Databricks using PySpark) that looks like the below:

As of November 2020, table Markdown is now supported on all Stack Exchange sites.

Date	Category
2021-01-02	A
2021-01-06	A
2021-01-07	A
2021-01-10	B
2021-01-15	A
2021-01-16	A
2021-01-18	A
2021-01-23	B
2021-01-25	A

And I'm trying to get the date difference (in days) between category A and the next category B, example below:

Date	Category	Days Diff
2021-01-02	A	8
2021-01-06	A	4
2021-01-07	A	3
2021-01-10	B
2021-01-15	A	8
2021-01-16	A	7
2021-01-18	A	5
2021-01-23	B
2021-01-25	A	...

Does anyone have any ideas of the best way to do this? Thanks!

CodePudding user response：

You can use window function to achieve this.

First, create a column that contains only the Date when Category is B.

df = df.withColumn('BDate', when(df.Category == 'B', df.Date))

 ---------- -------- ---------- 
|      Date|Category|     BDate|
 ---------- -------- ---------- 
|2021-01-02|       A|      null|
|2021-01-06|       A|      null|
|2021-01-07|       A|      null|
|2021-01-10|       B|2021-01-10|
|2021-01-15|       A|      null|
...

Then, create a window that can look up any following rows from the current row and find the first BDate within the window. Take the date difference between the current Date and the BDate that is found.

w = Window.orderBy('Date').rowsBetween(0, Window.unboundedFollowing)
df = df.withColumn('Days Diff', F.datediff(F.first(df.BDate, ignorenulls=True).over(w), df.Date))

Result

 ---------- -------- ---------- --------- 
|      Date|Category|     BDate|Days Diff|
 ---------- -------- ---------- --------- 
|2021-01-02|       A|      null|        8|
|2021-01-06|       A|      null|        4|
|2021-01-07|       A|      null|        3|
|2021-01-10|       B|2021-01-10|        0|
|2021-01-15|       A|      null|        8|
|2021-01-16|       A|      null|        7|
|2021-01-18|       A|      null|        5|
|2021-01-23|       B|2021-01-23|        0|
|2021-01-25|       A|      null|     null|
 ---------- -------- ---------- ---------