Home > Enterprise >  Scala - splitting dataframe based on number of rows
Scala - splitting dataframe based on number of rows

Time:10-04

I have a spark dataframe that has approximately a million records. I'm trying to split this dataframe into multiple small dataframes where each of these dataframes has a maximum rowCount of 20,000 (Each of these dataframes should have a row count of 20,000 except the last dataframe which may or may not have 20,000). Can you help me with this? Thank you.

CodePudding user response:

Ok, maybe not the the most efficient way, but here it is. You can create a new column that counts every row (in case you don't have a unique Id column). Here we are basically iterating over the whole dataframe and selecting batches of size 20k, adding them to a list of DataFrames.

import org.apache.spark.sql.functions._
import spark.implicits._
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.monotonically_increasing_id

var index = 0
val subsetSize = 20000
var listOfDF: List[DataFrame] = List()

// withColumn optional if you already have a unique id per row
val df = spark.table("your_table").withColumn("rowNum", monotonically_increasing_id())

def returnSubDF(fromIndex: Int, toIndex: Int) = {
  df.filter($"rowNum" >= fromIndex && $"rowNum" < toIndex)
}

while (index <= 1000000){
  listOfDF = listOfDF :  returnSubDF(index, index subsetSize)
  index  = subsetSize
}

listOfDF.head.show()
  • Related