If I need to perform an automated housekeeping task, and this is my query:
delete from sample_table where id = '1'
And, this scheduled query gets executed from multiple service instances. Will this have a significant performance impact? What would be an appropriate way of testing this?
CodePudding user response:
Issuing multiple deletes for the same partition can have a significant impact on your cluster.
Remember that all writes in Cassandra (INSERT
, UPDATE
, DELETE
) are inserts under the hood. Since Cassandra does not perform a read-before-write (with the exception of lightweight transactions), issuing a DELETE
will insert a tombstone marker regardless of whether the data exists or has already been deleted.
Every single DELETE
you issue counts as a write request so depending on how busy your cluster is, it may have a measurable impact on its performance. Cheers!
CodePudding user response:
Erick's answer is pretty solid, but I'd just like to add that the time that you'll likely see performance issues is at read-time. That's because doing a:
SELECT * FROM sample_table WHERE id='1';
...will read ALL of the times that the DELETE
was written (tombstones) from the SSTable file. The default settings on a table result in deleted data staying around for 10 days (to ensure proper replication) before they can be picked-up by compaction.
So figure out how many times that DELETE
happens per key over a 10 day period, and that's about how many Cassandra will have to reconcile at read-time.