Scala - splitting dataframe based on number of rows-CodePudding

I have a spark dataframe that has approximately a million records. I'm trying to split this dataframe into multiple small dataframes where each of these dataframes has a maximum rowCount of 20,000 (Each of these dataframes should have a row count of 20,000 except the last dataframe which may or may not have 20,000). Can you help me with this? Thank you.

CodePudding user response：

Ok, maybe not the the most efficient way, but here it is. You can create a new column that counts every row (in case you don't have a unique Id column). Here we are basically iterating over the whole dataframe and selecting batches of size 20k, adding them to a list of DataFrames.

import org.apache.spark.sql.functions._
import spark.implicits._
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.monotonically_increasing_id

var index = 0
val subsetSize = 20000
var listOfDF: List[DataFrame] = List()

// withColumn optional if you already have a unique id per row
val df = spark.table("your_table").withColumn("rowNum", monotonically_increasing_id())

def returnSubDF(fromIndex: Int, toIndex: Int) = {
  df.filter($"rowNum" >= fromIndex && $"rowNum" < toIndex)
}

while (index <= 1000000){
  listOfDF = listOfDF :  returnSubDF(index, index subsetSize)
  index  = subsetSize
}

listOfDF.head.show()