I have a dataframe like this (pyspark)
customer | date | col_x | day_number |
---|---|---|---|
1 | 5/10/2022 | ...(val doesn't matter) | NULL |
1 | 5/11/2022 | ... | NULL |
1 | 5/12/2022 | ... | 0 |
1 | 5/13/2022 | ... | NULL |
1 | 5/14/2022 | ... | NULL |
1 | 5/15/2022 | ... | 3 |
1 | 5/16/2022 | ... | 4 |
1 | 5/17/2022 | ... | NULL |
1 | 5/18/2022 | ... | 6 |
2 | 5/10/2022 | ...(val doesn't matter) | NULL |
2 | 5/11/2022 | ... | NULL |
2 | 5/12/2022 | ... | 0 |
2 | 5/13/2022 | ... | NULL |
2 | 5/14/2022 | ... | 2 |
2 | 5/15/2022 | ... | 3 |
2 | 5/16/2022 | ... | NULL |
2 | 5/17/2022 | ... | 5 |
2 | 5/18/2022 | ... | NULL |
and this pattern repeats for a number of customer.
What I would like to have is to fill the number between (for example for customer 1 on 5/13 and 5/14 the day number would be 1 and 2) and it should always be increasing by 1 since EVERY date value is populated.
I have tried a lag function and I believe this would work (code below) but I am wondering if there is any other way to do this, perhaps with row_number()
starting from the first non-null day_number
value for each grouping.
This is what I have tried
import pyspark.sql.functions as F
window = Window.partitionBy("customer").orderBy("date")
#LAG FUNCTION SEEMS TO WORK
df.select(
F.coalesce(F.col("day_number"), F.lit(F.lag(F.col("day_number")).over(window)) 1)
)
#SOME ROW_NUMBER() FUNCTION , DOES NOT WORK --
df.select(
F.coalesce(F.col("day_number"), F.row_number().over(window) F.lit(F.min("day_number").over(window) - 1)).alias("day_number")
)
Would be curious if/what the other solutions are out there, I also was thinking there is a way to use F.last()
, but there would need to be a way to add the offset ( 1, 2, 3) etc...
Thanks!
CodePudding user response:
one solution can definitely be done this way:
window = Window.partitionBy("customer").orderBy("date")
df.select(F.when(F.datediff(F.col("date"),F.first(F.when(F.col("day_number") == 0, F.col("date")), ignorenulls=True).over(fill_window))>=0,F.datediff(F.col("date"),F.first(F.when(F.col("day_number") == 0, F.col("date")), ignorenulls=True).over(fill_window))).alias("day_number_test")
but it doesn't seem like the prettiest solution, I'd be curious to know if any others exist which incorporate into a window function right away.