Home > Blockchain >  How to efficiently detect data corruption in a SQLite database?
How to efficiently detect data corruption in a SQLite database?

Time:09-18

Supposing there is no power loss or catastrophic OS crash so fsync() calls succeed and DB changes are truly written to oxide.

Since PRAGMA integrity_check does NOT [1] actually detect data corruption in a SQLite database, the developer needs to come up with a way to detect data corruption themselves.

Storing a checksum of the entire database file works but is too inefficient to do for every row insert (suppose the database is multiple gigabytes in size).

Keeping a checksum for each row is the obvious solution but this does not detect the case where entire rows are deleted by the data corruption. So we need a way to detect whether rows have been deleted from the table.

The method I thought up to detect missing rows is to keep a XOR sum of all checksums. This is efficient because every time a row is updated, we simply XOR the sum with the row's old checksum, and then XOR the sum with the row's new checksum. It occurs to me that this is not the strongest method but I have not been able to find any stronger alternatives that are efficient so suggestions are welcome.

EDIT: I have thought of an alternative method which requires all tables to be append-only. In an append-only table we can assume that the rowids are consecutive which means we can easily notice any missing rows except for the last row, which we can store somewhere.

References:

  1. https://www.sqlite.org/pragma.html

CodePudding user response:

Here is a scheme that should work:

In the table, add a "prev" and a "next" column which hold the primary keys of the previous and next rows respectively, so you can think of the table as a doubly linked list. This means that when a row is deleted from the table via data corruption, a scan through the table will find that a "next" key does not match the key of the next row, or the "prev" key does not match the key of the previous row. These columns should also be made UNIQUE so that two rows can't have the same "next" or "prev" row also they should have the FOREIGN KEY constraint.

See below for incorrect answer:

Pragma integrity_check will in fact detect missing rows, see where it says:

Missing or surplus index entries

So if you lose a row then it will say something like:

row 12345 missing from index some_index

EDIT: This answer is actually incorrect. The row x missing message only indicates that a row does not have a corresponding entry in the index, it does NOT mean that an entry in the index does not have a corresponding row (which is what would happen if a row was deleted).

  • Related