I have a simple table storing words for different ids.
CREATE TABLE words (
id INTEGER,
word TEXT,
);
CREATE INDEX ON words USING hash (id);
CREATE INDEX ON words USING hash (word);
Now I simply want to count the number of times a given word appears. My actual query is a bit different and involves other filters.
SELECT COUNT(1) FROM "words"
WHERE word = 'car'
My table has a billion of rows but the answer for this specific query is about 45k.
I hoped that the index on the words would make the query super fast but it still takes a 1 min 20 to be executed, which looks dereasonable. As a comparison, SELECT COUNT(1) FROM "words"
takes 1 min 57.
Here is the output of EXPLAIN:
Aggregate (cost=48667.00..48667.01 rows=1 width=8)
-> Bitmap Heap Scan on words (cost=398.12..48634.05 rows=13177 width=0)
Recheck Cond: (word = 'car'::text)
-> Bitmap Index Scan on words_word_idx (cost=0.00..394.83 rows=13177 width=0)
Index Cond: (word = 'car'::text)
I don't understand why there is a need to recheck the condition and why this query is not efficient.
CodePudding user response:
Hash indexes don't store the indexed value in the index, just its 32-bit hash and the ctid (pointer to the table row). That means they can't resolve hash collisions on their own, so it has to go to the table to obtain the value and then recheck it. This can involve a lot or extra IO compared to a btree index, which do store the value and can support index only scans.
CodePudding user response:
You could use B-Tree index whenever an indexed column is involved in a comparison using one of these operators:
< <= = >= >
I assume you are using =
for counting how many words are there. So, B-Tree index satisfy your requirements.
Reference: https://www.postgresql.org/docs/current/indexes-types.html#INDEXES-TYPES-BTREE