How to read parquet files using only one thread on a worker/task node?-CodePudding

In spark, if we execute the following command:

spark.sql("select * from parquet.`/Users/MyUser/TEST/testcompression/part-00009-asdfasdf-e829-421d-b14f-asdfasdf.c000.snappy.parquet`")
  .show(5,false)

Spark distributes the read on all threads on a worker/task node. How do we execute this command and just limit it to one thread? Is this even possible?

CodePudding user response：

If you want to do this for the whole spark session, you can limit the shuffle partitions (the number of partitions used for reduce actions) and the default parallelism (the number of partitions in RDDs for transformation actions) to 1:

spark.conf.set("spark.sql.shuffle.partitions",1)
spark.conf.set("spark.default.parallelism",1)
spark.sql("select * from parquet.`/Users/MyUser/TEST/testcompression/part-00009-asdfasdf-e829-421d-b14f-asdfasdf.c000.snappy.parquet`")
  .show(5,false)

If not, you can repartition your dataframe before calling an action operation:

spark.sql("select * from parquet.`/Users/MyUser/TEST/testcompression/part-00009-asdfasdf-e829-421d-b14f-asdfasdf.c000.snappy.parquet`")
  .repartition(1)
  .show(5,false)