Spark Cassandra and resource allocation-CodePudding

My understanding is that the default spark.cassandra.input.split.size_in_mb is 64MB.It means the number of tasks that will be created for reading data from Cassandra will be Approx_table_size/64. Let's say the table size is 6400 MB (we are simply reading the data, doing foreachPartition and writing the data back to a DB), so the number of tasks will be 100. But while executing the job on YARN, if I specifically set --num-executors 3, --executor-cores 2, so this should create at max 6 tasks for the job. Now will the conf setting override the input.split.size value of 100 tasks when executed? OR will it be the case where while reading the data 100 tasks will be created but after that the partitions will be reduced to 6 and data shuffle would take place.

CodePudding user response：

First thing to mention - spark.cassandra.input.split.size_in_mb is default value, but if you have Cassandra partitions greater than that value, then it the Spark partition will have a size of the Cassandra partition, not the value from that setting.

Regarding processing - not really. Spark Cassandra Connector will create 100 Spark partitions, that will be handled by Spark using the available cores (6), so it means that it will repeat the code 17 times (6*16 4), as in Spark each partition is handled by single core. Shuffling will happen only if you do .repartition explicitly.