Home > Enterprise >  Mysql insert query performance on indexed column if the column value is null
Mysql insert query performance on indexed column if the column value is null

Time:05-27

I have a mysql table having ~50 million rows. I want to add a secondary index on a column that may have NULL values. I want to understand if inserting a row in this table that has value of this column as NULL will still be an expensive operation? Or is inserting a row adds an overhead only for the indexed columns having non null values in Mysql?

CodePudding user response:

Let's try an experiment. I created a test table:

mysql> create table mytable (id serial primary key, x int, y int);

I filled it with a few million rows. Then I tested a 1-million row INSERT of NULLs:

mysql> insert into mytable (x, y) select null, null from mytable limit 1000000;
Query OK, 1000000 rows affected (2.57 sec)

And the same with non-NULL values:

mysql> insert into mytable (x, y) select 1234, 1234 from mytable limit 1000000;
Query OK, 1000000 rows affected (2.60 sec)

Now add an index and try the test again:

mysql> alter table mytable add index (x);

mysql> insert into mytable (x, y) select null, null from mytable limit 1000000;
Query OK, 1000000 rows affected (3.12 sec)

mysql> insert into mytable (x, y) select 1234, 1234 from mytable limit 1000000;
Query OK, 1000000 rows affected (3.21 sec)

Now I add an index on the last column, so there are two index writes instead of just one, and try the test again:

mysql> alter table mytable add index (y);

mysql> insert into mytable (x, y) select null, null from mytable limit 1000000;
Query OK, 1000000 rows affected (3.64 sec)

mysql> insert into mytable (x, y) select 1234, 1234 from mytable limit 1000000;
Query OK, 1000000 rows affected (3.82 sec)

I know this test is flawed. I'm lazy and I'm not reinitializing the table to its initial size before each test. So the table is getting larger and larger, and that's probably accounting for the increase in time of each test.

The point is not to prove the answer one way or the other. It's to show that if you have a question like this, you have the opportunity and the responsibility to test it yourself. That's probably going to give better results than asking on Stack Overflow, for several reasons:

  • You don't have to wait for someone to answer, if anyone ever does.

  • You avoid spurious answers from people who don't actually know.

  • You avoid answers based on flawed methodology, like the one I showed above.

There's a reason your education was in Computer Science. You should embrace your role as a scientist, and think of what kind of experiment could give you the answer (with proper methodology).

CodePudding user response:

Think of NULL as just another value.

Think of an INDEX as a list of pairs -- the key value and some kind of pointer to the row. (The key value may be NULL.) Also, think of the INDEX being just like a table -- stored in a BTree. This is ordered by the key, just as the data is ordered by the PRIMARY KEY in its BTree.

Adding a row to the table adds a row to the data's BTree and to each secondary INDEX's BTree.

By thinking of NULL as just another value, you can reasonably guess that the various operations don't treat NULL as different.

So, if the business logic needs NULL, use it without worrying.

There are usage issues with NULL. WHERE x = NULL should probably "wrong" and should be WHERE x IS NULL. NULL is not equal anything, including another NULL. And other issues where NULL is not fully "just another value".

  • Related