Suppose I have a table A with columns: (id, item, price). It has records like (banana, 1), (banana, 1), (apple, 2). I want to calculate the average price after deduplication based on the item column. In the example that I just gave, the average should be (1 2)/ 2 = 1.5.
There is a simple-minded way of doing this:
Select avg(price) from (select min(price) from A group by item).
However in reality I have a huge table so it is not realistic to do the select subquery first. I wonder whether there is any hack around this.
CodePudding user response:
Outside of deduplicating this per some of the comments, one approach is to add an index on both item
and price
. With one million rows, time to execute is reduced from ~0.85 s to 0 s.
SELECT AVG(price)
FROM (
SELECT MIN(price) price
FROM A
GROUP BY item
) t;
-- 1 row(s) returned 0.875 sec / 0.000 sec
ALTER TABLE A ADD INDEX itemPriceIndex (item, price)
-- 0 row(s) affected Records: 0 Duplicates: 0 Warnings: 0 1.891 sec
SELECT AVG(price)
FROM (
SELECT MIN(price) price
FROM A
GROUP BY item
) t
-- 1 row(s) returned 0.000 sec / 0.000 sec
Data generation code (only 100,000 rows generated below but timings above were 1,000,000 rows):
DROP DATABASE IF EXISTS stackOverflow;
CREATE DATABASE stackOverflow;
USE stackOverflow;
CREATE TABLE A (
id INT NOT NULL AUTO_INCREMENT,
item VARCHAR(20),
price INT,
PRIMARY KEY (id)
);
DROP PROCEDURE if EXISTS generateData;
delimiter #
CREATE PROCEDURE generateData(in nReps int)
BEGIN
DECLARE v_counter int unsigned DEFAULT 0;
TRUNCATE TABLE A;
START transaction;
while v_counter < nReps DO
INSERT INTO A (item, price) VALUES ("banana", 1);
SET v_counter = v_counter 1;
END WHILE;
INSERT INTO A (item, price) VALUES("apple", 2);
COMMIT;
END #
delimiter ;
CALL generateData(100000);
CodePudding user response:
You can use AVG()
window function after deduplication:
SELECT DISTINCT AVG(MAX(price)) OVER () AS avg_price
FROM tablename
GROUP BY item;
See the demo.