I have two dataframes, and I want to add to the first of them all the columns that are in the second, but not in the first. I got an array of StructField columns that I want to add to the dataframe, and fill with nulls.
That's the best I've come up with:
private def addColumns(df: DataFrame, columnsToAdd: Array[StructField]): DataFrame = {
val spark = df.sparkSession
val schema = new StructType(df.schema.toArray columnsToAdd)
spark.createDataFrame(df.rdd, schema)
}
Is there any better way?
CodePudding user response:
My solution that I gave in the question unfortunately does not work. Crashes with the error java.lang.ArrayIndexOutOfBoundsException. As I understand it, the fact is that even though I added columns to the schema, they were not added to the dataframe, spark is trying to access the next data frame field, which is in the schema, but not in the real data.
I wrote such a variant, it uses recursion and does what I want. Although of course I would like to abandon the use of null, and somehow replace it with None.
@tailrec
private def addColumns(df: DataFrame, columnsToAdd: Array[StructField], indx: Int): DataFrame = {
if(columnsToAdd.length == indx || columnsToAdd.isEmpty) df
else {
val dfWithColumn = df.withColumn(columnsToAdd(indx).name, lit(null).cast(columnsToAdd(indx).dataType))
addColumns(dfWithColumn, columnsToAdd, indx 1)
}
}
Also this answer helped a lot.