SQL (analytical) query to retrieve first and second rows for each value of a group by-CodePudding

I have a situation that I would like to resolve in SQL using as few inner joins as possible, so I am thinking of using window functions.

I am using Snowflake, but I am happy to see answers in other DB engines as well.

Table sales has sale_id and seller_id, as well as sale_datetime, amount... Table seller has seller_id

I would like to get the first and second sales for each seller, as columns preferably (so that I can add and subtract eg sum of amounts). If there is only one sale, ideally I would not retrieve it.

I am trying to use something like this:

select se.seller_id,sa.*, first_value(sale_datetime) over (ORDER BY
sale_datetime ),  LAG(sale_datetime) OVER (ORDER BY sale_datetime)
from seller se inner join sales sa  on sa.seller_id = se.seller_id
order by se.seller_id;

But I would need to add a group by, because the window functions need to be applied at the seller level, not to the total of the table. Whether I add it at the end of the query or inside the window, I get an error.

The other way that I have of doing this without windows is with two with...as clauses that retrieve first and second sale for each seller respectively and a join between them, but I am trying to avoid this for performance reasons.

Thanks!

CodePudding user response：

You answer was already mostly there. You just need to use the LAG and use ROW_NUMBER/QUALIFY

CTE for Data:

with sales(sale_id, seller_id, sale_datetime, amount) as (
    select * from values
    (4, 10, '2022-04-24'::date, 100),
    (3, 10, '2022-04-23'::date, 200),
    (2, 10, '2022-04-22'::date, 400),
    (1, 11, '2022-04-22'::date, 300)
)

working SQL:

select sa.*,
    LAG(sale_datetime) OVER (partition by sa.seller_id ORDER BY sa.sale_datetime) as prior_sale_date,
    LAG(amount) OVER (partition by sa.seller_id ORDER BY sa.sale_datetime) as prior_amount  
FROM sales AS sa  
qualify row_number() over(partition by sa.seller_id order by sale_datetime) = 2
order by sa.seller_id;

gives:

SALE_ID	SELLER_ID	SALE_DATETIME	AMOUNT	PRIOR_SALE_DATE	PRIOR_AMOUNT
2	10	2022-04-23	200	2022-04-22	400

this works because every row get the prior values via the two LAG's, but we only keep the second row.

I Also dropped the seller se table because it is add no value to the process, as it's just a filter.., but that can be present via:

more complex data CTE's:

    select * from values
    (4, 10, '2022-04-24'::date, 100),
    (3, 10, '2022-04-23'::date, 200),
    (2, 10, '2022-04-22'::date, 400),
    (1, 11, '2022-04-22'::date, 300)
), seller(seller_id) as (
    select * from values (10),(11),(12),(13)
)

working SQL with JOIN back in.

select sa.*,
    LAG(sale_datetime) OVER (partition by sa.seller_id ORDER BY sa.sale_datetime) as prior_sale_date,
    LAG(amount) OVER (partition by sa.seller_id ORDER BY sa.sale_datetime) as prior_amount  
FROM seller as se
JOIN sales AS sa
    ON se.seller_id = sa.seller_id
qualify row_number() over(partition by sa.seller_id order by sale_datetime) = 2
order by sa.seller_id;

gives:

SALE_ID	SELLER_ID	SALE_DATETIME	AMOUNT	PRIOR_SALE_DATE	PRIOR_AMOUNT
3	10	2022-04-23	200	2022-04-22	400

CodePudding user response：

ROW_NUMBER and QUALIFY could be used:

select se.seller_id,sa.*
from seller se 
join sales sa  
  on sa.seller_id = se.seller_id
qualify row_number() over(partition by sa.seller_id order by sale_datetime) <= 2
order by se.seller_id;