So, I have a 16 node cluster where every node has Spark and Cassandra installed with a replication factor of 3 and spark.sql.shuffle.partitions of 96. I am using the Spark-Cassandra Connector 3.0.0 for doing a repartitionByCassandraReplica.JoinWithCassandraTable
and then some SparkML analysis takes place. My question is what happens eventually with the spark partitions?
1st scenario
My PartitionsPerHost
parameter of repartitionByCassandraReplica
is numberofSelectedCassandraPartitionkeys which means if I choose 4 partition keys I get 4 partitions per Host. This gives me 64 spark partitions because I have 16 hosts.
2nd scenario
But, according to the Spark Cassandra connector documentation, information from system.size_estimates
table should be used in order to calculate the spark partitions. For example from my system.size_estimates
:
estimated_table_size = mean_partition_size x number_of_partitions
= (24416287.87/1000000) MB x 332
= 8106.2 MB
spark_partitions = estimated_table_size / input.split.size_in_mb
= 8106.2 MB / 64 MB
= 126.6593 partitions
so, when does the 1st scenario takes place and when the second? Am I calculating something wrong? Is there specific cases where the 1st scenario happens and other cases the 2nd?
CodePudding user response:
Those are two completely different paths by which the number of Spark partitions are calculated.
If you're calling repartitionByCassandraReplica()
, the number of Spark partitions are determined by both partitionsPerHost
and the number of Cassandra nodes in the local DC.
Otherwise, the connector will use input.split.size_in_mb
to determine the number of Spark partitions based on the estimated table size. Cheers!