How to make this group by faster-CodePudding

I have a big table with over 3B rows and my query now takes 3 days to execute.

This is my query:

insert into frequencywords (word, frequency, filename)
select word, count(*) as frequency, max(filename)
from allwordstemp
group by word

Basically, my query group by word from allwordstemp table, and I want to know the filename when frequency = 1 that's why I added max(filename) because filename needs to be contained in an aggregate function such as max. I also don't need the value of filename if frequency > 1. Both tables have 2 indexes on word and filename.

allwordstemp table (filename is an id file):

CREATE TABLE `allwordstemp` (
  `word` varchar(45) DEFAULT NULL,
  `filename` int(11) DEFAULT NULL,
  KEY `idx_allwordstemp_word` (`word`),
  KEY `idx_allwordstemp_filename` (`filename`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

frequencywords table:

CREATE TABLE `frequencywords` (
  `word` varchar(45) DEFAULT NULL,
  `frequency` int(11) DEFAULT NULL,
  `filename` int(11) DEFAULT NULL,
  KEY `idx_frequencywords_word` (`word`),
  KEY `idx_frequencywords_frequency` (`frequency`),
  KEY `idx_frequencywords_filename` (`filename`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

EXPLAIN SELECT:

 ---- ------------- -------------- ------------ ------- --------------- ----------------------- --------- ------ ------------ ---------- ------------- 
| id | select_type | table        | partitions | type  | possible_keys | key                   | key_len | ref  | rows       | filtered | Extra       |
 ---- ------------- -------------- ------------ ------- --------------- ----------------------- --------- ------ ------------ ---------- ------------- 
|  1 | SIMPLE      | allwordstemp | NULL       | index | NULL          | idx_allwordstemp_word | 138     | NULL | 3487864881 |   100.00 | Using index |
 ---- ------------- -------------- ------------ ------- --------------- ----------------------- --------- ------ ------------ ---------- -------------

How can I make my query faster?

CodePudding user response：

Your filter criteria appears to be over frequency, and not word or filename. SO the first thing I'd do is index allwordstemp by frequency.

Then, assuming frequency is a whole number, I'd add a WHERE clause like so:

select word, max(filename)
from allwordstemp
where frequency = 1
group by word

The above will give you a list of all words that appear EXACTLY once in various file names.

I hope this helps, and all the best!

CodePudding user response：

Plan A

That is strange. It is using an index that it should avoid. Try the "index hint" IGNORE INDEX(idx_allwordstemp_word)

How often do you run this query? Does the table always have billions of rows? Do the two columns need to be NULL?

Plan B:

If they could be NOT NULL, we could add a id BIGINT AUTO_INCREMENT and change to

PRIMARY KEY(word, filename, id),
INDEX(id),
INDEX(filename)

Making such a change now might take days; I am suggesting this for next time.

If there is no need for indexing filename, then leave that out. (It is not useful in the current Select.)

Plan C

Replace KEY idx_allwordstemp_word (word) with

INDEX(word, filename)

(Again, this may take a long time.)

Bottom line

If this is one-time task, use Plan A. If the columns need to be NULLable, use Plan C. Else, Plan B is probably the best.