Home > Blockchain >  Spark PySpark Increasing value lag/last function with window
Spark PySpark Increasing value lag/last function with window

Time:10-08

I have a dataframe like this (pyspark)

customer date col_x day_number
1 5/10/2022 ...(val doesn't matter) NULL
1 5/11/2022 ... NULL
1 5/12/2022 ... 0
1 5/13/2022 ... NULL
1 5/14/2022 ... NULL
1 5/15/2022 ... 3
1 5/16/2022 ... 4
1 5/17/2022 ... NULL
1 5/18/2022 ... 6
2 5/10/2022 ...(val doesn't matter) NULL
2 5/11/2022 ... NULL
2 5/12/2022 ... 0
2 5/13/2022 ... NULL
2 5/14/2022 ... 2
2 5/15/2022 ... 3
2 5/16/2022 ... NULL
2 5/17/2022 ... 5
2 5/18/2022 ... NULL

and this pattern repeats for a number of customer.

What I would like to have is to fill the number between (for example for customer 1 on 5/13 and 5/14 the day number would be 1 and 2) and it should always be increasing by 1 since EVERY date value is populated.

I have tried a lag function and I believe this would work (code below) but I am wondering if there is any other way to do this, perhaps with row_number() starting from the first non-null day_number value for each grouping.

This is what I have tried

import pyspark.sql.functions as F

window = Window.partitionBy("customer").orderBy("date")


#LAG FUNCTION SEEMS TO WORK
df.select(
F.coalesce(F.col("day_number"), F.lit(F.lag(F.col("day_number")).over(window))   1)
)
#SOME ROW_NUMBER() FUNCTION , DOES NOT WORK -- 
df.select(
F.coalesce(F.col("day_number"), F.row_number().over(window)   F.lit(F.min("day_number").over(window) - 1)).alias("day_number")
)

Would be curious if/what the other solutions are out there, I also was thinking there is a way to use F.last(), but there would need to be a way to add the offset ( 1, 2, 3) etc...

Thanks!

CodePudding user response:

one solution can definitely be done this way:

window = Window.partitionBy("customer").orderBy("date")

df.select(F.when(F.datediff(F.col("date"),F.first(F.when(F.col("day_number") == 0, F.col("date")), ignorenulls=True).over(fill_window))>=0,F.datediff(F.col("date"),F.first(F.when(F.col("day_number") == 0, F.col("date")), ignorenulls=True).over(fill_window))).alias("day_number_test")

but it doesn't seem like the prettiest solution, I'd be curious to know if any others exist which incorporate into a window function right away.

  • Related