Spark dataframe process partitions in batches, N partitions at a time-CodePudding

I need to process Spark dataframe partitions in batches, N partitions at a time. For example if i have 1000 partitions in hive table, i need to process 100 partitions at a time.

I tried following approach

Get partition list from hive table and find total count
Get loop count using total_count/100

Then

 for x in range(loop_count):
     files_list=partition_path_list[start_index:end_index]            
     df = spark.read.option("basePath", target_table_location).parquet(*files_list)

But this is not working as expected. Can anyone suggest a better method. Solution in Spark Scala is preferred

CodePudding user response：

The for loop you have is just having x increment each time. That's why the start and end indices do not increment.

Not sure why you mention Scala since your code is in Python. Here's an example with loop count being 1000.

partitions_per_iteration = 100
loop_count = 1000
for start_index in range(0, loop_count, partitions_per_iteration):
    files_list=partition_path_list[start_index:start_index   partitions_per_iteration]
    df = spark.read.option("basePath", target_table_location).parquet(*files_list)