Home > other >  How to make this group by faster
How to make this group by faster

Time:04-03

I have a big table with over 3B rows and my query now takes 3 days to execute.

This is my query:

insert into frequencywords (word, frequency, filename)
select word, count(*) as frequency, max(filename)
from allwordstemp
group by word

Basically, my query group by word from allwordstemp table, and I want to know the filename when frequency = 1 that's why I added max(filename) because filename needs to be contained in an aggregate function such as max. I also don't need the value of filename if frequency > 1. Both tables have 2 indexes on word and filename.

allwordstemp table (filename is an id file):

CREATE TABLE `allwordstemp` (
  `word` varchar(45) DEFAULT NULL,
  `filename` int(11) DEFAULT NULL,
  KEY `idx_allwordstemp_word` (`word`),
  KEY `idx_allwordstemp_filename` (`filename`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

frequencywords table:

CREATE TABLE `frequencywords` (
  `word` varchar(45) DEFAULT NULL,
  `frequency` int(11) DEFAULT NULL,
  `filename` int(11) DEFAULT NULL,
  KEY `idx_frequencywords_word` (`word`),
  KEY `idx_frequencywords_frequency` (`frequency`),
  KEY `idx_frequencywords_filename` (`filename`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

EXPLAIN SELECT:

 ---- ------------- -------------- ------------ ------- --------------- ----------------------- --------- ------ ------------ ---------- ------------- 
| id | select_type | table        | partitions | type  | possible_keys | key                   | key_len | ref  | rows       | filtered | Extra       |
 ---- ------------- -------------- ------------ ------- --------------- ----------------------- --------- ------ ------------ ---------- ------------- 
|  1 | SIMPLE      | allwordstemp | NULL       | index | NULL          | idx_allwordstemp_word | 138     | NULL | 3487864881 |   100.00 | Using index |
 ---- ------------- -------------- ------------ ------- --------------- ----------------------- --------- ------ ------------ ---------- ------------- 

How can I make my query faster?

CodePudding user response:

Your filter criteria appears to be over frequency, and not word or filename. SO the first thing I'd do is index allwordstemp by frequency.

Then, assuming frequency is a whole number, I'd add a WHERE clause like so:

select word, max(filename)
from allwordstemp
where frequency = 1
group by word

The above will give you a list of all words that appear EXACTLY once in various file names.

I hope this helps, and all the best!

CodePudding user response:

Plan A

That is strange. It is using an index that it should avoid. Try the "index hint" IGNORE INDEX(idx_allwordstemp_word)

How often do you run this query? Does the table always have billions of rows? Do the two columns need to be NULL?

Plan B:

If they could be NOT NULL, we could add a id BIGINT AUTO_INCREMENT and change to

PRIMARY KEY(word, filename, id),
INDEX(id),
INDEX(filename)

Making such a change now might take days; I am suggesting this for next time.

If there is no need for indexing filename, then leave that out. (It is not useful in the current Select.)

Plan C

Replace KEY idx_allwordstemp_word (word) with

INDEX(word, filename)

(Again, this may take a long time.)

Bottom line

If this is one-time task, use Plan A. If the columns need to be NULLable, use Plan C. Else, Plan B is probably the best.

  • Related