How to drop duplicate columns based on another schema in spark scala?-CodePudding

Imagine I have two different dataframes with similar schemas:

df0.printSchema
root
 |-- single: integer (nullable = false)
 |-- double: integer (nullable = false)

and:

df1.printSchema
root
 |-- newColumn: integer (nullable = false)
 |-- single: integer (nullable = false)
 |-- double: double (nullable = false)

Now I merge these two schemas like below and create a new dataframe with this merged schema:

val consolidatedSchema = df0.schema.  :(df1.schema).toSet

val uniqueConsolidatedSchemas = StructType(consolidatedSchema.toSeq)

val emptyDF = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], uniqueConsolidatedSchemas)

emptyDF.printSchema
root
 |-- newColumn: integer (nullable = false)
 |-- single: integer (nullable = false)
 |-- double: integer (nullable = false)
 |-- double: double (nullable = false)

But as you can see, I have two fields with name of double, but with different data types. How can I keep the one that its data type is matched with the one in df0 schema and drop the other one?

I want the final schema to be like this:

finalDF.printSchema
root
 |-- newColumn: integer (nullable = false)
 |-- single: integer (nullable = false)
 |-- double: integer (nullable = false)

I really appreciate if you suggest any other method to merge these two schemas and reach my goal.

Thank you in advance.

CodePudding user response：

You can filter the second schema to exclude the fields that are already present in the first one before you concatenate the lists:

val uniqueConsolidatedSchemas = StructType(
  df0.schema    df1.schema.filter(c1 =>
    !df0.schema.exists(c0 => c0.name == c1.name)
  )
)