I'm not sure my question title makes much sense so I shall explain with an example below.
I have a dataframe (in Databricks using PySpark) that looks like the below:
As of November 2020, table Markdown is now supported on all Stack Exchange sites.
Date | Category |
---|---|
2021-01-02 | A |
2021-01-06 | A |
2021-01-07 | A |
2021-01-10 | B |
2021-01-15 | A |
2021-01-16 | A |
2021-01-18 | A |
2021-01-23 | B |
2021-01-25 | A |
And I'm trying to get the date difference (in days) between category A and the next category B, example below:
Date | Category | Days Diff |
---|---|---|
2021-01-02 | A | 8 |
2021-01-06 | A | 4 |
2021-01-07 | A | 3 |
2021-01-10 | B | |
2021-01-15 | A | 8 |
2021-01-16 | A | 7 |
2021-01-18 | A | 5 |
2021-01-23 | B | |
2021-01-25 | A | ... |
Does anyone have any ideas of the best way to do this? Thanks!
CodePudding user response:
You can use window function to achieve this.
First, create a column that contains only the Date when Category is B.
df = df.withColumn('BDate', when(df.Category == 'B', df.Date))
---------- -------- ----------
| Date|Category| BDate|
---------- -------- ----------
|2021-01-02| A| null|
|2021-01-06| A| null|
|2021-01-07| A| null|
|2021-01-10| B|2021-01-10|
|2021-01-15| A| null|
...
Then, create a window that can look up any following rows from the current row and find the first BDate
within the window. Take the date difference between the current Date
and the BDate
that is found.
w = Window.orderBy('Date').rowsBetween(0, Window.unboundedFollowing)
df = df.withColumn('Days Diff', F.datediff(F.first(df.BDate, ignorenulls=True).over(w), df.Date))
Result
---------- -------- ---------- ---------
| Date|Category| BDate|Days Diff|
---------- -------- ---------- ---------
|2021-01-02| A| null| 8|
|2021-01-06| A| null| 4|
|2021-01-07| A| null| 3|
|2021-01-10| B|2021-01-10| 0|
|2021-01-15| A| null| 8|
|2021-01-16| A| null| 7|
|2021-01-18| A| null| 5|
|2021-01-23| B|2021-01-23| 0|
|2021-01-25| A| null| null|
---------- -------- ---------- ---------