Spark on Cassandra : is there a way to remove data by partition key?-CodePudding

The spark Cassandra connector has the RDD.deleteFromCassandra(keyspaceName, tableName) method.

The values in the RDD are interpreted as Primary Key constraints.

I have a table like that :

CREATE TABLE table (a int, b int, c int, PRIMARY KEY (a,b));

As you can see, a is the partition key, and b the clustering key.

I need to have a spark app that remove efficiently by partition_key, and not by primary key.

Indeed, my goal is to always drop entire partitions by their partition keys, and not create a thombstones for each primary key.

How to do that with spark connector ?

Thank you

CodePudding user response：

Yes, it's possible to do if you specify keyColumns parameter to the .deleteFromCassandra function (docs). For example, if you have composite partition key consisting of two columns part1 & part2:

rdd.deleteFromCassandra("keyspace", "table", 
  keyColumns = SomeColumns("part1", "part2"))

This method works only with RDDs, if you use DataFrames, then you just need to do df.rdd. Also, in some versions of connector, you may need to restrict selection just to partition columns - see discussion in this answer.