Im trying to do report system by range date (from/to date) for products(around 30k products that will repetitive record in DB) viewed/clicked by users. Each time user click on product i record product_id and date on single row in Database. My problem is coming when i have to select and display report because table quickly grow up to 4 millions in 2 months and i have to keep record up to 6 months.
My question is there any better way to optimize the query or the way i record them ?
DB Table
CREATE TABLE `product_view` (
`id` int(11) NOT NULL,
`product_id` int(11) NOT NULL,
`date_create` datetime NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
ALTER TABLE `product_view`
ADD PRIMARY KEY (`id`),
ADD KEY `product_id` (`product_id`) USING BTREE;
My select query without range date take around 50 seconds to pull results
SELECT SQL_NO_CACHE pd.name, pv.product_id, p.model, COUNT(pv.id) as total
FROM product_view pv
LEFT JOIN product p ON p.product_id = pv.product_id
LEFT JOIN product_description pd ON pd.product_id = pv.product_id
WHERE pv.product_id > 0
GROUP BY pv.product_id
ORDER BY total
DESC LIMIT 0,20
Query EXAMPlE
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE pv range product_id product_id 4 NULL 1647717 Using where; Using index; Using temporary; Using filesort
1 SIMPLE p eq_ref PRIMARY PRIMARY 4 test.pv.product_id 1
1 SIMPLE pd ref PRIMARY,product_id PRIMARY 4 test.pv.product_id 1
Query with range date
SELECT SQL_NO_CACHE pd.name, pv.product_id, p.model, COUNT(pv.id) as total
FROM product_view pv
LEFT JOIN product p ON p.product_id = pv.product_id
LEFT JOIN product_description pd ON pd.product_id = pv.product_id
WHERE pv.product_id > 0
AND DATE(pv.date_create) >= '2021-07-25'
AND DATE(pv.date_create) <= '2022-03-10'
GROUP BY pv.product_id
ORDER BY total DESC LIMIT 0,20
CodePudding user response:
Try rewriting the query as a correlated query:
select p.product_id, p.model, pd.name, (
select count(*)
from product_view as pv
where pv.product_id = p.product_id
and pv.date_create >= '2021-07-25'
and pv.date_create < '2022-03-10' interval 1 day
) as total
from product as p
left join product_description as pd on p.product_id = pd.product_id
where exists (
select 1
from product_view as pv
where pv.product_id = p.product_id
and pv.date_create >= '2021-07-25'
and pv.date_create < '2022-03-10' interval 1 day
-- this is a far more optimized version for dates used in your op
)
order by total desc
limit 0, 20
This does not involve grouping so it should be faster than your original query. If date filter is not required then remove the where exists
part and and pv.date_create ...
from the count sub-query.
Secondly, I don't see any useful indexes in the explain. You should try the following indexes:
create index ix1 on product_view (product_id, date_create)
-- should be (i) good for joining (ii) "covers" the date column
CodePudding user response:
CREATE TABLE `product_view` (
-- toss, as useless: `id` int(11) NOT NULL,
`product_id` int(11) NOT NULL,
`date_create` datetime NOT NULL,
PRIMARY KEY(product_id, date_create)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Then PARTITION BY RANGE(TO_DAYS(date_create))
to make the deletion much more efficient. I recommend about 30 weekly partitions. See Partition .
COUNT(pv.id)
-- The usual pattern is simply COUNT(*)
. With an id, it checks that id for being NOT NULL
, which is unnecessary.
pv.product_id > 0
-- is there something special about ids <=0?
Let's rearrange the query to "start" with the count:
SELECT pd.name, pv.product_id, p.model, s.total
FROM ( SELECT pv.product_id, COUNT(*) AS total
FROM product_view AS pv
WHERE pv.date_create >= '2021-07-25'
) AS s
JOIN product AS p ON p.product_id = pv.product_id
ORDER BY total DESC
LIMIT 0, 20
Note:
- Get rid of
DATE()
, it is not "sargable" and prevents the use of any index. - If you want the count to go through yesterday, then add
AND pv.date_create < CURDATE()
LEFT
implies that the 'right' row may be missing; I suspect that is not the case.- I got rid of
pd
since it is not used (and costs a lot to have if it won't be used). SQL_NO_CACHE
is going away in 8.0; you may as well turn off the Query cache now.- If the date range goes back through most of the data, the entire table will be scanned, so I have not helped performance much in that area. So...
- Build and maintain a Summary Table with (dy, product_id, subtotal); then do the query against it. See Summary Tables