spark bucket number doesn't equal to the number of files in the partition?-CodePudding

val spark = SparkSession.builder().appName("Spark SQL basic example").config("spark.master", "local").getOrCreate()


import spark.implicits._
case class Something(id: Int, batchId: Option[String], div: String)
val sth1 = Something(1, Some("1000"), "10")
val sth2 = Something(2, Some("1000"), "10")
val sth3 = Something(3, Some("1000"), "10")
val sth4 = Something(4, Some("1000"), "10")
val ds = Seq(sth1, sth2, sth3, sth4).toDS()
ds.write.mode("overwrite").option("path", "loacl_path").bucketBy(3, "id").saveAsTable("Tmp")

I go to the local_path where it stores the data but I only find two parquet files. I wonder why it doesn't create 3 parquet files which is the number of bucket.

I have also tried bucket number equals to 1 or 2, it does impact the number of parquet files stored in local path. When bucket numer is 1, then there is only 1 parquet file, similarly for the case when it equals to 2.

CodePudding user response：

bucketBy is not probably what you're looking for (if you're expecting your data to be written inside 3 parquet files). when you use bucketBy, you define the column names, and a hash function is responsible to divide your data into number of buckets you specified, it doesn't necessarily mean that they should be saved in n files. This is used to boost your querying performance (something similar to indexing, not equal). Now I haven't tried this yet, but what you're looking for probably is repartition method.

df.repartition(3)
  .write.mode(SaveMode.Overwrite)
  .option("path", "local_path")
  .saveAsTable("Tmp")

CodePudding user response：

You should use repartition and you will get the number of output files equals to number of partitions which you defined.

You can still have the bucketBy with combination with repartition, but bucketBy has different use - to increase join performance.

ds.repartition(3)
.write.mode("overwrite")
.option("path", "loacl_path")
.bucketBy(3, "id")
.saveAsTable("Tmp")

I am sending a link to my blog. You may find there other interesting Spark topics that interest you: https://bigdata-etl.com/articles/big-data/apache-spark/