I have two tables, one table is a list of purchases with revenue, purchase_time and a user id, the other table has a list of a list of campaign clicks with campaign_id, user_id, click_time. campaign_clicks essentially logs all clicks from a campaign, there can any number of clicks or none and these could happen anytime, before or after a purchase but what I need to do is determine which campaign_id was the last campaign to be clicked on before the purchase was made by any given user and what was the total revenue attributed to that campaign_id. I want to only attribute revenue to clicks that occurred within 3 days prior to the purchase.
purchases
date | user_id | revenue | purchase_time |
---|---|---|---|
2020/09/01 | 10 | 30.0 | 2020/09/01 10:10:00 am |
2020/09/01 | 20 | 15.0 | 2020/09/02 09:15:00 am |
2020/09/01 | 30 | 25.0 | 2020/09/02 08:15:00 am |
campaign_clicks
user_id | campaign_id | click_time |
---|---|---|
10 | 2 | 2020/09/01 10:01:00 am |
10 | 1 | 2020/09/01 10:05:00 am |
10 | 2 | 2020/09/01 10:20:00 am |
20 | 2 | 2020/09/01 10:10:00 am |
30 | 2 | 2020/09/01 07:30:00 am |
desired result
date | campaign_id | revenue |
---|---|---|
2020/09/01 | 1 | 30.0 |
2020/09/01 | 2 | 25.0 |
purchase from user id 20 shouldn't be included because it occurred before the click_time. User 10 revenue should be attributed to campaign 2 because the click occurred just before the purchase.
My problem is the join I have is returning all the clicks which is inflating the revenue. The select in the inner join isn’t aware of the purchase time, I need to somehow filter and narrow down the clicks to a single click, the last click. I've tried using ROW_NUMBER() to apply an index but that doesn't allow me to filter out clicks that occur after the purchase.
This is where I’m at
SELECT
date
,ROUND(sum(revenue)) as revenue
,campaign_clicks.campaign_id
FROM
purchases
LEFT JOIN (
SELECT
campaign_id
,user_id
,click_time
FROM
campaign_clicks
ORDER BY
click_time DESC
) AS clicks ON clicks.user_id = purchases.user_id
WHERE
-- only select campaign clicks that occurred before the purchase
purchases.purchase_time > clicks.click_time
-- only include clicks that occurred within 3 days of the purchase
AND DATEDIFF(minutes, clicks.click_time,purchases.purchase_time) <= (60*24*3)
-- PROBLEM HERE - there can be still a number of other clicks that occurred before the purchase I need to filter to only the last one
GROUP BY
date
,clicks.campaign_id
CodePudding user response:
Well you can achieve this using the following query. So basically, you can perform a INNER JOIN
and filter out dates where the duration crosses 3 days within the ON
clause itself.
Now coming to limiting to the last clicked campaign, it can be achieved using ROW_NUMBER
function and setting the order of sequence to clicked_time DESC
. This way the last clicked date before purchase will have a sequence no. of 1. You can then just filter out the records where row_number is greater than 1 by wrapping the result set in a outer query.
-- Outer query to select just the last click for a any given purchase
SELECT * FROM (
SELECT p.date, p.purchase_time, c.click_time, c.campaign_id, p.revenue,
-- sequential row number for clicks sorted in descending order of date
ROW_NUMBER() OVER(PARTITION BY c.user_id ORDER BY c.click_time DESC) AS row_num
FROM purchases p
INNER JOIN campaign_clicks c
ON (
c.user_id = p.user_id
--- only select clicks that occured before the purchase
AND c.click_time<p.purchase_time
--- only select the clicks that occurred 3 days prior (mins * hours * days )
AND TIMESTAMPDIFF(MINUTE, c.click_time, p.purchase_time) <= (60*24*3)
)
) res WHERE res.row_num=1
You can also check the results on the DB-Fiddle link
CodePudding user response:
Snowflake supports joining laterally. That is, on to a function or correlated sub-query. This allows you to join on to a query that returns just one row (per input row).
SELECT
purchases.date
,purchases.revenue
,clicks.campaign_id
FROM
purchases
LEFT JOIN LATERAL
(
SELECT
campaign_id
,user_id
,click_time
FROM
campaign_clicks
WHERE
user_id = purchases.user_id
-- only select campaign clicks that occurred before the purchase
AND click_time < purchases.purchase_time
-- only include clicks that occurred within 3 days of the purchase
AND click_time >= DATEADD(days, -3, purchases.purchase_time)
ORDER BY
click_time DESC
LIMIT
1
)
AS clicks
(Please forgive typos, I'm on my phone.)