Cassandra Table Compaction for timeseries-CodePudding

My program generates large amount time-series data into the following table:

CREATE TABLE AccountData
(
    PartitionKey text,
    RowKey text,
    AccountId uuid,
    UnitId uuid,
    ContractId uuid,
    Id uuid,
    LocationId uuid,
    ValuesJson text,
    PRIMARY KEY (PartitionKey, RowKey)
)
WITH CLUSTERING ORDER BY (RowKey ASC)

The PartitionKey is a dictionary value (one of 10) and the RowKey is DateTime converted to long.

Now due to the crazy amount of data that is being generated by the program, every ContractId has a different retention policy in the code. The code goes and deletes old data based on the retention for the specific ContractId.

I am now running into problems where during a SELECT statement it picks up too many Tombstones and I get an error.

What Table Compaction strategy should I use to solve this Tombstone problem?

CodePudding user response：

PartitionKey is a dictionary value (one of 10)

I think this is likely your problem. Basically, all of the data in the cluster is ending up on 10 partitions. Those are going to get extremely large as time progresses. In general, you want to keep your partitions between 1MB-10MB in size. The lower the better.

I would recommend splitting the partition up. If it's time related, take a time unit which makes the most sense to your query pattern. For example, if most of the queries are month-based, perhaps something like this might work:

PRIMARY KEY ((month,PartitionKey),RowKey)

That will create a partition for each combination of month and the current PartitionKey.

Likewise, most time series use cases tend to query most-recent data more often. To that end, it usually makes sense to sort data in the partitions by time, in descending order. That is of course, if RowKey is indeed a data/time value.

WITH CLUSTERING ORDER BY (RowKey DESC)

Also, a nice little side-effect of this model, is that any old data which is tombstoned is now at the "bottom" of the partition. So, depending on the delete patterns, tombstones will still exist. But if the data is clustered in descending order...the tombstones are never/rarely queried.

What Table Compaction strategy should I use to solve this Tombstone problem?

So I do not believe that simply changing the compaction strategy will be the silver bullet to solve this problem. That being said, I suggest looking into the TimeWindowCompactionStrategy. That compaction strategy stores its SSTable files by a designated time period (window). This prevents files full of old, obsoleted, or tombstoned data from being queried.