On Spark 2.4.4 with spark-cassandra-connector 2.5.1 when retrieving data from a materialized view a lot of duplicates are returned. The code is very simple:
val df = spark.read.format("org.apache.spark.sql.cassandra").options(Map( "table" -> "table1", "keyspace" -> "test" )).load()
df.select("id").filter("date=20211215")
but if a run this code (changing the value on the filter):
val df = spark.read.format("org.apache.spark.sql.cassandra").options(Map( "table" -> "table1", "keyspace" -> "test" )).load()
df.select("id").filter("date>=20211215 and date<20211216")
the result is correct without the duplicates.
In my environment Cassandra has a table:
CREATE TABLE IF NOT EXISTS table(
id text,
some_other_id text,
date int,
PRIMARY KEY (id, some_other_id)
);
CREATE INDEX table_by_data ON table(date);
Do you know what is the root cause? Am I missing some configuration?
CodePudding user response:
Can you please share which Scylla version you're using? scylla --version
output.
There's a good chance that your issue is the same issue reported here
There's already a fix for this, but it's currently available in Scylla OSS 4.6 (still not GA'd). 4.6 RC1 was promoted earlier today.