Most efficient way in mysql for taking average after deduplication-CodePudding

Suppose I have a table A with columns: (id, item, price). It has records like (banana, 1), (banana, 1), (apple, 2). I want to calculate the average price after deduplication based on the item column. In the example that I just gave, the average should be (1 2)/ 2 = 1.5.

There is a simple-minded way of doing this:

Select avg(price) from (select min(price) from A group by item).

However in reality I have a huge table so it is not realistic to do the select subquery first. I wonder whether there is any hack around this.

CodePudding user response：

Outside of deduplicating this per some of the comments, one approach is to add an index on both item and price. With one million rows, time to execute is reduced from ~0.85 s to 0 s.

SELECT AVG(price)
FROM (
    SELECT MIN(price) price
    FROM A
    GROUP BY item
) t;

--  1 row(s) returned   0.875 sec / 0.000 sec

ALTER TABLE A ADD INDEX itemPriceIndex (item, price)
--  0 row(s) affected Records: 0  Duplicates: 0  Warnings: 0    1.891 sec

SELECT AVG(price)
 FROM (
  SELECT MIN(price) price
     FROM A
     GROUP BY item
 ) t
-- 1 row(s) returned    0.000 sec / 0.000 sec

Data generation code (only 100,000 rows generated below but timings above were 1,000,000 rows):

DROP DATABASE IF EXISTS stackOverflow;
CREATE DATABASE stackOverflow;
USE stackOverflow;

CREATE TABLE A (
   id INT NOT NULL AUTO_INCREMENT,
   item VARCHAR(20),
   price INT,
   PRIMARY KEY (id)
);

DROP PROCEDURE if EXISTS generateData;
delimiter #
CREATE PROCEDURE generateData(in nReps int)
BEGIN
    DECLARE v_counter int unsigned DEFAULT 0;
    
    TRUNCATE TABLE A;
    START transaction;
    while v_counter < nReps DO
        INSERT INTO A (item, price) VALUES ("banana", 1);
        SET v_counter = v_counter   1;
    END WHILE;
    INSERT INTO A (item, price) VALUES("apple", 2);
    COMMIT;
END #
delimiter ;
CALL generateData(100000);

CodePudding user response：

You can use AVG() window function after deduplication:

SELECT DISTINCT AVG(MAX(price)) OVER () AS avg_price
FROM tablename
GROUP BY item;

See the demo.