Given a data that looks as follows where the date is in string format YYYYMMDD
:
item | vietnamese | cost | unique_id | sales_date |
---|---|---|---|---|
fruits | trai cay | 10 | abc123 | 20211001 |
fruits | trai cay | 8 | foo99 | 20211001 |
fruits | trai cay | 9 | foo99 | 20211001 |
vege | rau | 3 | rr1239 | 20211001 |
vege | rau | 3 | rr1239 | 20211001 |
fruits | trai cay | 12 | abc123 | 20211002 |
fruits | trai cay | 14 | abc123 | 20211002 |
fruits | trai cay | 8 | abc123 | 20211002 |
fruits | trai cay | 5 | foo99 | 20211002 |
vege | rau | 8 | rr1239 | 20211002 |
vege | rau | 1 | rr1239 | 20211002 |
vege | rau | 12 | ud9213 | 20211002 |
vege | rau | 19 | r11759 | 20211002 |
fruits | trai cay | 6 | foo99 | 20211003 |
fruits | trai cay | 2 | abc123 | 20211003 |
fruits | trai cay | 12 | abc123 | 20211003 |
vege | rau | 1 | ud97863 | 20211003 |
vege | rau | 9 | r112359 | 20211003 |
fruits | trai cay | 6 | foo99 | 20211004 |
fruits | trai cay | 2 | abc123 | 20211004 |
fruits | trai cay | 12 | abc123 | 20211004 |
vege | rau | 9 | r112359 | 20211004 |
The goal is sample all the rows within a certain time frame, e.g. 2020-10-02 to 2020-10-03 and to extract a maximum of 3 rows per day, e.g. with this query:
SELECT * FROM mytable
WHERE sales_date BETWEEN '20211002' AND '20211003'
ORDER BY RAND () LIMIT 6
the expected output for the table above is:
item | vietnamese | cost | unique_id | sales_date |
---|---|---|---|---|
fruits | trai cay | 8 | abc123 | 20211002 |
fruits | trai cay | 5 | foo99 | 20211002 |
vege | rau | 8 | rr1239 | 20211002 |
fruits | trai cay | 12 | abc123 | 20211003 |
vege | rau | 1 | ud97863 | 20211003 |
vege | rau | 9 | r112359 | 20211003 |
But there is a possibility that all 6 rows expected comes from a single day:
item | vietnamese | cost | unique_id | sales_date |
---|---|---|---|---|
fruits | trai cay | 12 | abc123 | 20211002 |
fruits | trai cay | 14 | abc123 | 20211002 |
fruits | trai cay | 8 | abc123 | 20211002 |
fruits | trai cay | 5 | foo99 | 20211002 |
vege | rau | 8 | rr1239 | 20211002 |
vege | rau | 1 | rr1239 | 20211002 |
So to ensure that I have max 3 rows a day, I'm running multiple queries per day, i.e.
SELECT * FROM mytable
WHERE sales_date='20211002'
ORDER BY RAND () LIMIT 3
and
SELECT * FROM mytable
WHERE sales_date='20211003'
ORDER BY RAND () LIMIT 3
Is there a way to ensure N no. of max limit rows per day in a single query?
Otherwise is there a way to combine those one query per day into a "super-query"? If we're talking about a full year, it'll 365 queries, one per day.
CodePudding user response:
Since 6 rows over 2 days means exactly 3 rows per day, let's expand it to a week.
In a subquery use row_number
to assign a number to each row for each date. Then only select those with a row number of 3 or less.
select *
from (
select
*,
row_number() over (partition by sales_date order by rand()) as row
from mytable
where sales_date between '20211002' and '20211009'
)
where row <= 3
order by rand()
limit 6