Home > Blockchain >  How to record repetitive data by date
How to record repetitive data by date

Time:03-11

Im trying to do report system by range date (from/to date) for products(around 30k products that will repetitive record in DB) viewed/clicked by users. Each time user click on product i record product_id and date on single row in Database. My problem is coming when i have to select and display report because table quickly grow up to 4 millions in 2 months and i have to keep record up to 6 months.

My question is there any better way to optimize the query or the way i record them ?

DB Table

CREATE TABLE `product_view` (
`id` int(11) NOT NULL,
`product_id` int(11) NOT NULL,
`date_create` datetime NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

ALTER TABLE `product_view`
 ADD PRIMARY KEY (`id`),
 ADD KEY `product_id` (`product_id`) USING BTREE;

My select query without range date take around 50 seconds to pull results

SELECT SQL_NO_CACHE pd.name, pv.product_id, p.model, COUNT(pv.id) as total
FROM product_view pv
    LEFT JOIN product p ON p.product_id = pv.product_id
    LEFT JOIN product_description pd ON pd.product_id = pv.product_id
WHERE pv.product_id > 0
GROUP BY pv.product_id
ORDER BY total
DESC LIMIT 0,20

Query EXAMPlE

id  select_type     table   type      possible_keys     key           key_len      ref                      rows        Extra
1   SIMPLE           pv     range   product_id           product_id     4         NULL                      1647717     Using where; Using index; Using temporary; Using filesort
1   SIMPLE           p     eq_ref   PRIMARY              PRIMARY        4         test.pv.product_id        1
1   SIMPLE           pd     ref     PRIMARY,product_id   PRIMARY        4         test.pv.product_id        1

Query with range date

SELECT SQL_NO_CACHE pd.name, pv.product_id, p.model, COUNT(pv.id) as total
FROM product_view pv
    LEFT JOIN product p ON p.product_id = pv.product_id
    LEFT JOIN product_description pd ON pd.product_id = pv.product_id
WHERE pv.product_id > 0
    AND DATE(pv.date_create) >= '2021-07-25'
    AND DATE(pv.date_create) <= '2022-03-10'
GROUP BY pv.product_id
ORDER BY total DESC LIMIT 0,20

CodePudding user response:

Try rewriting the query as a correlated query:

select p.product_id, p.model, pd.name, (
    select count(*)
    from product_view as pv
    where pv.product_id = p.product_id
    and pv.date_create >= '2021-07-25'
    and pv.date_create <  '2022-03-10'   interval 1 day
) as total
from product as p
left join product_description as pd on p.product_id = pd.product_id
where exists (
    select 1
    from product_view as pv
    where pv.product_id = p.product_id
    and pv.date_create >= '2021-07-25'
    and pv.date_create <  '2022-03-10'   interval 1 day
    -- this is a far more optimized version for dates used in your op
)
order by total desc
limit 0, 20

This does not involve grouping so it should be faster than your original query. If date filter is not required then remove the where exists part and and pv.date_create ... from the count sub-query.

Secondly, I don't see any useful indexes in the explain. You should try the following indexes:

create index ix1 on product_view (product_id, date_create)
-- should be (i) good for joining (ii) "covers" the date column

CodePudding user response:

CREATE TABLE `product_view` (
-- toss, as useless:  `id` int(11) NOT NULL,
`product_id` int(11) NOT NULL,
`date_create` datetime NOT NULL,
PRIMARY KEY(product_id, date_create)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Then PARTITION BY RANGE(TO_DAYS(date_create)) to make the deletion much more efficient. I recommend about 30 weekly partitions. See Partition .

COUNT(pv.id) -- The usual pattern is simply COUNT(*). With an id, it checks that id for being NOT NULL, which is unnecessary.

pv.product_id > 0 -- is there something special about ids <=0?

Let's rearrange the query to "start" with the count:

SELECT pd.name, pv.product_id, p.model, s.total
    FROM ( SELECT pv.product_id, COUNT(*) AS total
             FROM product_view AS pv
             WHERE pv.date_create >= '2021-07-25'
         ) AS s
    JOIN product AS p  ON p.product_id = pv.product_id
    ORDER BY total DESC
    LIMIT 0, 20

Note:

  • Get rid of DATE(), it is not "sargable" and prevents the use of any index.
  • If you want the count to go through yesterday, then add AND pv.date_create < CURDATE()
  • LEFT implies that the 'right' row may be missing; I suspect that is not the case.
  • I got rid of pd since it is not used (and costs a lot to have if it won't be used).
  • SQL_NO_CACHE is going away in 8.0; you may as well turn off the Query cache now.
  • If the date range goes back through most of the data, the entire table will be scanned, so I have not helped performance much in that area. So...
  • Build and maintain a Summary Table with (dy, product_id, subtotal); then do the query against it. See Summary Tables
  • Related