I have a dataframe having the following columns
'CUSTOMER_ID','ORDERED_TIME','ORDER_ID'
Here, ORDERED_TIME
is a timestamp field indicating when the order was placed, CUSTOMER_ID
is the customer id (INTEGER) and ORDER_ID
is the order_id (INTEGER)
Now, from this dataframe, for every order made in the month of Jan to March 2021, I need to find out the time the previous order was placed just before ORDERED_TIME
. Let's call this column PREV_ORDER_TIME
The dataframe includes order data all the way back to 2019. However, I am only computing PREV_ORDER_TIME
for all orders made between Jan to March. If there is no PREV_ORDER_TIME
order for an order, I would like to display null
My final expected output is something like ['CUSTOMER_ID','ORDERED_TIME','ORDER_ID','PREV_ORDER_TIME']
The way I understand, for some orders in these 3 months, the previous orders can be in these 3 months, but for some other orders, it may well be before this window.
I am not sure how to approach this. I feel like I might need to use some binary search to find the closest date, but am not really sure how to make this work only for orders from Jan to March despite my dataset containing orders well before that.
Please let me know if any other information is needed
Any help would be appreciated. Thanks
CodePudding user response:
You can use lag
function over a window, as follow:
from pyspark.sql import Window
from pyspark.sql import functions as F
window = Window().partitionBy("CUSTOMER_ID").orderBy('ORDERED_TIME')
dataframe.withColumn('PREV_ORDER_TIME', F.lag('ORDERED_TIME').over(window))
If there is no previous order, column PREV_ORDER_TIME
will be null
And if you want to compute this PREV_ORDER_TIME
column only for date between 2021-01-01 and 2021-03-31, you can add a when
clause, as follow:
from datetime import datetime
from pyspark.sql import Window
from pyspark.sql import functions as F
window = Window().partitionBy("CUSTOMER_ID").orderBy('ORDERED_TIME')
dataframe.withColumn(
'PREV_ORDER_TIME',
F.when(
(F.col('ORDERED_TIME') < datetime(2021, 4, 1)) & (F.col('ORDERED_TIME') >= datetime(2021, 1, 1)),
F.lag('ORDERED_TIME').over(window)
)
)
For all orders when date is not between 2021-01-01 and 2021-03-31, column PREV_ORDER_TIME
will be null
So if you have the following input dataframe:
----------- -------- -------------------
|CUSTOMER_ID|ORDER_ID|ORDERED_TIME |
----------- -------- -------------------
|1 |21 |2019-12-15 10:20:30|
|1 |22 |2020-12-16 11:21:31|
|1 |23 |2021-01-17 12:22:32|
|2 |24 |2021-01-18 13:23:33|
|2 |25 |2021-02-19 14:24:34|
|2 |26 |2021-03-20 15:25:35|
----------- -------- -------------------
You will get the following output:
----------- -------- ------------------- -------------------
|CUSTOMER_ID|ORDER_ID|ORDERED_TIME |PREV_ORDER_TIME |
----------- -------- ------------------- -------------------
|1 |21 |2019-12-15 10:20:30|null |
|1 |22 |2020-12-16 11:21:31|null |
|1 |23 |2021-01-17 12:22:32|2020-12-16 11:21:31|
|2 |24 |2021-01-18 13:23:33|null |
|2 |25 |2021-02-19 14:24:34|2021-01-18 13:23:33|
|2 |26 |2021-03-20 15:25:35|2021-02-19 14:24:34|
----------- -------- ------------------- -------------------