Home > Software design >  Spark returning a lot of duplicates using Scylladb as a source
Spark returning a lot of duplicates using Scylladb as a source

Time:12-17

On Spark 2.4.4 with spark-cassandra-connector 2.5.1 when retrieving data from a materialized view a lot of duplicates are returned. The code is very simple:

val df = spark.read.format("org.apache.spark.sql.cassandra").options(Map( "table" -> "table1", "keyspace" -> "test" )).load()
df.select("id").filter("date=20211215")

but if a run this code (changing the value on the filter):

val df = spark.read.format("org.apache.spark.sql.cassandra").options(Map( "table" -> "table1", "keyspace" -> "test" )).load()
df.select("id").filter("date>=20211215 and date<20211216")

the result is correct without the duplicates.

In my environment Cassandra has a table:

CREATE TABLE IF NOT EXISTS table(
    id text,
    some_other_id text,
    date int,
    PRIMARY KEY (id, some_other_id)
);

CREATE INDEX table_by_data ON table(date);

Do you know what is the root cause? Am I missing some configuration?

CodePudding user response:

Can you please share which Scylla version you're using? scylla --version output. There's a good chance that your issue is the same issue reported here

There's already a fix for this, but it's currently available in Scylla OSS 4.6 (still not GA'd). 4.6 RC1 was promoted earlier today.

  • Related