I have a big table with over 3B rows and my query now takes 3 days to execute.
This is my query:
insert into frequencywords (word, frequency, filename)
select word, count(*) as frequency, max(filename)
from allwordstemp
group by word
Basically, my query group by word from allwordstemp table, and I want to know the filename when frequency = 1 that's why I added max(filename) because filename needs to be contained in an aggregate function such as max. I also don't need the value of filename if frequency > 1. Both tables have 2 indexes on word and filename.
allwordstemp table (filename is an id file):
CREATE TABLE `allwordstemp` (
`word` varchar(45) DEFAULT NULL,
`filename` int(11) DEFAULT NULL,
KEY `idx_allwordstemp_word` (`word`),
KEY `idx_allwordstemp_filename` (`filename`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
frequencywords table:
CREATE TABLE `frequencywords` (
`word` varchar(45) DEFAULT NULL,
`frequency` int(11) DEFAULT NULL,
`filename` int(11) DEFAULT NULL,
KEY `idx_frequencywords_word` (`word`),
KEY `idx_frequencywords_frequency` (`frequency`),
KEY `idx_frequencywords_filename` (`filename`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
EXPLAIN SELECT:
---- ------------- -------------- ------------ ------- --------------- ----------------------- --------- ------ ------------ ---------- -------------
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---- ------------- -------------- ------------ ------- --------------- ----------------------- --------- ------ ------------ ---------- -------------
| 1 | SIMPLE | allwordstemp | NULL | index | NULL | idx_allwordstemp_word | 138 | NULL | 3487864881 | 100.00 | Using index |
---- ------------- -------------- ------------ ------- --------------- ----------------------- --------- ------ ------------ ---------- -------------
How can I make my query faster?
CodePudding user response:
Your filter criteria appears to be over frequency, and not word or filename. SO the first thing I'd do is index allwordstemp by frequency.
Then, assuming frequency is a whole number, I'd add a WHERE clause like so:
select word, max(filename)
from allwordstemp
where frequency = 1
group by word
The above will give you a list of all words that appear EXACTLY once in various file names.
I hope this helps, and all the best!
CodePudding user response:
Plan A
That is strange. It is using an index that it should avoid. Try the "index hint" IGNORE INDEX(idx_allwordstemp_word)
How often do you run this query? Does the table always have billions of rows? Do the two columns need to be NULL
?
Plan B:
If they could be NOT NULL
, we could add a id BIGINT AUTO_INCREMENT
and change to
PRIMARY KEY(word, filename, id),
INDEX(id),
INDEX(filename)
Making such a change now might take days; I am suggesting this for next time.
If there is no need for indexing filename
, then leave that out. (It is not useful in the current Select.)
Plan C
Replace KEY idx_allwordstemp_word (word)
with
INDEX(word, filename)
(Again, this may take a long time.)
Bottom line
If this is one-time task, use Plan A. If the columns need to be NULLable, use Plan C. Else, Plan B is probably the best.