Joining table to itself with multiple join criteria logic-CodePudding

I'm trying to understand the logic behind the syntax below. Based on the following question, table and syntax:

Write a query that'll identify returning active users. A returning active user is a user that has made a second purchase within 7 days of any other of their purchases. Output a list of user_ids of these returning active users.

Column Data Type: id: int | user_id: int | item: varchar |created_at: datetime | revenue: int

SELECT DISTINCT(a1.user_id)
FROM amazon_transactions a1
JOIN amazon_transactions a2 ON a1.user_id=a2.user_id
    AND a1.id <> a2.id
    AND a2.created_at::date-a1.created_at::date BETWEEN 0 AND 7
ORDER BY a1.user_id

Why does the table need to be joined to itself in this case?
How does 'AND a1.id <> a2.id' portion of syntax contribute to the join?

CodePudding user response：

You are looking for users that have 2 records on that table whose date distance is lower (or equal) than 7 days To accomplish this, you treat the table as if it were 2 different (but equal tables) because you have to match a row on the first table with a row on the second table

Of course you don't want to match a row with itself, so

AND a1.id <> a2.id

accomplishes that

CodePudding user response：

The table needs to be joined with itself because, you just have one table, and you want to find out returning users (by comparing the duration between transaction dates for the same user).
AND a1.id <> a2.id portion of the syntax removes the same transactions, i.e. prevents the transactions with the same id to be included in the joined table.

CodePudding user response：

There are two scenarios I can think of based on the id column values. Are id column values generated based on timely sequence ? If so, to answer your first question ,we can but don't have to use join syntax. Here is how to achieve your goal using a correlated subquery , with sample data created.

create table amazon_transactions(id int , user_id int , item varchar(20),created_at datetime , revenue int);
insert amazon_transactions (id,user_id,created_at) values
(1,1,'2020-01-05 15:33:22'),
(2,2,'2020-01-05 16:33:22'),
(3,1,'2020-01-08 18:33:22'),
(4,1,'2020-01-22 17:33:22'),
(5,2,'2020-02-05 15:33:22'),
(6,2,'2020-03-05 15:33:22');

select * from amazon_transactions;
-- sample set:
| id   | user_id | item | created_at          | revenue |
 ------ --------- ------ --------------------- --------- 
|    1 |       1 | NULL | 2020-01-05 15:33:22 |    NULL |
|    2 |       2 | NULL | 2020-01-05 16:33:22 |    NULL |
|    3 |       1 | NULL | 2020-01-08 18:33:22 |    NULL |
|    4 |       1 | NULL | 2020-01-22 17:33:22 |    NULL |
|    5 |       2 | NULL | 2020-02-05 15:33:22 |    NULL |
|    6 |       2 | NULL | 2020-03-05 15:33:22 |    NULL |

-- Here is the answer using a correlated subquery:

select distinct user_id
from amazon_transactions t
where datediff(
                (select created_at from amazon_transactions where user_id=t.user_id and id-t.id>=1 limit 1 ),
                created_at
            )<=7
;
-- result:
| user_id |
 --------- 
|       1 |

However,what if the id values are NOT transaction time based? Then the id values are not at all helpful in our requirement. In this case, a JOIN is more capable than a correlated subquery and we need to arrange the order based on transaction time for each user in order to make the necessary join condition. And to answer your second question, the AND a1.id <> a2.id portion of syntax contribute by excluding two of the same transaction making a pair. However, to my understanding the matching scope is too high to be effective. We only care if CONSECUTIVE transactions have a within-7-day gap, but the AND a1.id <> a2.id overdoes the job. For instance, we want to check the gap between transaction1 and transaction2,transaction2 and transaction3, NOT transaction1 and transaction3
Note: by using the user variable row_id trick, we can produce the row id which is used to match consecutive transactions for each user, thus eliminating the wasteful job of random transaction check.

 
select distinct t1.user_id
from
    (select user_id,created_at,@row_id:=@row_id 1 as row_id 
    from amazon_transactions ,(select @row_id:=0) t
    order by user_id,created_at)t1
join
    (select user_id,created_at,@row_num:=@row_num 1 as row_num 
    from amazon_transactions ,(select @row_num:=0) t
    order by user_id,created_at)t2
on t1.user_id=t2.user_id and t2.row_num-t1.row_id=1 and datediff(t2.created_at,t1.created_at)<=7
;
-- result
| user_id |
 --------- 
|       1 |