I have a spark dataframe that has approximately a million records. I'm trying to split this dataframe into multiple small dataframes where each of these dataframes has a maximum rowCount of 20,000 (Each of these dataframes should have a row count of 20,000 except the last dataframe which may or may not have 20,000). Can you help me with this? Thank you.
CodePudding user response:
Ok, maybe not the the most efficient way, but here it is. You can create a new column that counts every row (in case you don't have a unique Id column). Here we are basically iterating over the whole dataframe and selecting batches of size 20k, adding them to a list of DataFrames.
import org.apache.spark.sql.functions._
import spark.implicits._
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.monotonically_increasing_id
var index = 0
val subsetSize = 20000
var listOfDF: List[DataFrame] = List()
// withColumn optional if you already have a unique id per row
val df = spark.table("your_table").withColumn("rowNum", monotonically_increasing_id())
def returnSubDF(fromIndex: Int, toIndex: Int) = {
df.filter($"rowNum" >= fromIndex && $"rowNum" < toIndex)
}
while (index <= 1000000){
listOfDF = listOfDF : returnSubDF(index, index subsetSize)
index = subsetSize
}
listOfDF.head.show()