The spark Cassandra connector has the RDD.deleteFromCassandra(keyspaceName, tableName)
method.
The values in the RDD are interpreted as Primary Key constraints.
I have a table like that :
CREATE TABLE table (a int, b int, c int, PRIMARY KEY (a,b));
As you can see, a
is the partition key
, and b
the clustering key
.
I need to have a spark app
that remove efficiently by partition_key
, and not by primary key
.
Indeed, my goal is to always drop entire partitions by their partition keys
, and not create a thombstones for each primary key
.
How to do that with spark connector ?
Thank you
CodePudding user response:
Yes, it's possible to do if you specify keyColumns
parameter to the .deleteFromCassandra
function (docs). For example, if you have composite partition key consisting of two columns part1
& part2
:
rdd.deleteFromCassandra("keyspace", "table",
keyColumns = SomeColumns("part1", "part2"))
This method works only with RDDs, if you use DataFrames, then you just need to do df.rdd
. Also, in some versions of connector, you may need to restrict selection just to partition columns - see discussion in this answer.