I am faced with a rare problem where I need to attribute certain records by promixity of the date and remove the duplicates. My starting point is a join between vehicle sales table and parts sales table. After the join, the data looks like this:
|cust_key|vehicle_key|veh_purchase_date|parts_order_no|parts_purchase_date|parts_sales|parts_qty|
1 A 2019-01-01 1002 2022-02-01 100 2
1 B 2022-01-01 1002 2022-02-01 100 2
1 A 2019-01-01 1000 2021-02-01 50 2
1 B 2022-01-01 1000 2021-02-01 50 2
1 A 2019-01-01 1001 2021-12-31 200 1
1 B 2022-01-01 1001 2021-12-31 200 1
As you can see this customer purchased two different vehicles and they made three distinct orders of parts. I need to attribute the orders of parts to the vehicle purchases based on the date proximity. Parts should be attributed to the most recent vehicle sale as long as the parts order happened no earlier than 30 days before the vehicle purchase. The expected output is below. Here the order 1000 is correctly attributed to vehicle purchase A, since it happened after the purchase of vehicle A but <-30 days before the purchase of vehicle B. The order 1002 is after the purchase of vehicle B and finally the order 1001 happened 2 days before purchase of vehicle B, hence falls into the -30 day period and is attributed to vehicle B.
|cust_key|vehicle_key|veh_purchase_date|parts_order_no|parts_purchase_date|parts_sales|parts_qty|
1 A 2019-01-01 1000 2021-02-01 50 2
1 B 2022-01-01 1002 2022-02-01 100 2
1 B 2022-01-01 1001 2021-12-31 200 1
Final result is deduped with parts orders correctly attributed to the vehicle purchases. I have no clue how to do this in SQL. I thought potentially use ROW_NUMBER() to rank vehicle purchases and parts purchases by date and then filter the redundant ones, but I have no clue how to incorporate this -30 days rule.
CodePudding user response:
I think this should get you started. You can join this back to your original table to get the full data spread. You will probably need to modify this slightly based on edge cases not discussed.
SELECT MAX(Y.veh_purchase_date), Y.parts_order_no
FROM (
SELECT ROW_NUMBER() OVER (PARTITION BY X.parts_order_no ORDER BY X.parts_purchase_date, X.days_diff) AS row_num, x.*
FROM (
SELECT *, DATEDIFF(DAY, parts_purchase_date, veh_purchase_date) AS days_diff
FROM your_table_name_goes_here
) X
) Y
WHERE Y.days_diff <= 30
GROUP BY Y.parts_order_no