Home > Enterprise >  Find out the closest value Pyspark
Find out the closest value Pyspark

Time:09-24

I have a dataframe having the following columns

'CUSTOMER_ID','ORDERED_TIME','ORDER_ID'

Here, ORDERED_TIME is a timestamp field indicating when the order was placed, CUSTOMER_ID is the customer id (INTEGER) and ORDER_ID is the order_id (INTEGER)

Now, from this dataframe, for every order made in the month of Jan to March 2021, I need to find out the time the previous order was placed just before ORDERED_TIME. Let's call this column PREV_ORDER_TIME

The dataframe includes order data all the way back to 2019. However, I am only computing PREV_ORDER_TIME for all orders made between Jan to March. If there is no PREV_ORDER_TIME order for an order, I would like to display null

My final expected output is something like ['CUSTOMER_ID','ORDERED_TIME','ORDER_ID','PREV_ORDER_TIME']

The way I understand, for some orders in these 3 months, the previous orders can be in these 3 months, but for some other orders, it may well be before this window.

I am not sure how to approach this. I feel like I might need to use some binary search to find the closest date, but am not really sure how to make this work only for orders from Jan to March despite my dataset containing orders well before that.

Please let me know if any other information is needed

Any help would be appreciated. Thanks

CodePudding user response:

You can use lag function over a window, as follow:

from pyspark.sql import Window
from pyspark.sql import functions as F

window = Window().partitionBy("CUSTOMER_ID").orderBy('ORDERED_TIME')

dataframe.withColumn('PREV_ORDER_TIME', F.lag('ORDERED_TIME').over(window))

If there is no previous order, column PREV_ORDER_TIME will be null

And if you want to compute this PREV_ORDER_TIME column only for date between 2021-01-01 and 2021-03-31, you can add a when clause, as follow:

from datetime import datetime

from pyspark.sql import Window
from pyspark.sql import functions as F

window = Window().partitionBy("CUSTOMER_ID").orderBy('ORDERED_TIME')

dataframe.withColumn(
  'PREV_ORDER_TIME', 
  F.when(
    (F.col('ORDERED_TIME') < datetime(2021, 4, 1)) & (F.col('ORDERED_TIME') >= datetime(2021, 1, 1)),
    F.lag('ORDERED_TIME').over(window)
  )
)

For all orders when date is not between 2021-01-01 and 2021-03-31, column PREV_ORDER_TIME will be null

So if you have the following input dataframe:

 ----------- -------- ------------------- 
|CUSTOMER_ID|ORDER_ID|ORDERED_TIME       |
 ----------- -------- ------------------- 
|1          |21      |2019-12-15 10:20:30|
|1          |22      |2020-12-16 11:21:31|
|1          |23      |2021-01-17 12:22:32|
|2          |24      |2021-01-18 13:23:33|
|2          |25      |2021-02-19 14:24:34|
|2          |26      |2021-03-20 15:25:35|
 ----------- -------- ------------------- 

You will get the following output:

 ----------- -------- ------------------- ------------------- 
|CUSTOMER_ID|ORDER_ID|ORDERED_TIME       |PREV_ORDER_TIME    |
 ----------- -------- ------------------- ------------------- 
|1          |21      |2019-12-15 10:20:30|null               |
|1          |22      |2020-12-16 11:21:31|null               |
|1          |23      |2021-01-17 12:22:32|2020-12-16 11:21:31|
|2          |24      |2021-01-18 13:23:33|null               |
|2          |25      |2021-02-19 14:24:34|2021-01-18 13:23:33|
|2          |26      |2021-03-20 15:25:35|2021-02-19 14:24:34|
 ----------- -------- ------------------- ------------------- 
  • Related