Imagine I have two different dataframes with similar schemas:
df0.printSchema
root
|-- single: integer (nullable = false)
|-- double: integer (nullable = false)
and:
df1.printSchema
root
|-- newColumn: integer (nullable = false)
|-- single: integer (nullable = false)
|-- double: double (nullable = false)
Now I merge these two schemas like below and create a new dataframe with this merged schema:
val consolidatedSchema = df0.schema. :(df1.schema).toSet
val uniqueConsolidatedSchemas = StructType(consolidatedSchema.toSeq)
val emptyDF = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], uniqueConsolidatedSchemas)
emptyDF.printSchema
root
|-- newColumn: integer (nullable = false)
|-- single: integer (nullable = false)
|-- double: integer (nullable = false)
|-- double: double (nullable = false)
But as you can see, I have two fields with name of double
, but with different data types.
How can I keep the one that its data type is matched with the one in df0
schema and drop the other one?
I want the final schema to be like this:
finalDF.printSchema
root
|-- newColumn: integer (nullable = false)
|-- single: integer (nullable = false)
|-- double: integer (nullable = false)
I really appreciate if you suggest any other method to merge these two schemas and reach my goal.
Thank you in advance.
CodePudding user response:
You can filter the second schema to exclude the fields that are already present in the first one before you concatenate the lists:
val uniqueConsolidatedSchemas = StructType(
df0.schema df1.schema.filter(c1 =>
!df0.schema.exists(c0 => c0.name == c1.name)
)
)