Home > Mobile >  How to drop duplicate columns based on another schema in spark scala?
How to drop duplicate columns based on another schema in spark scala?

Time:12-26

Imagine I have two different dataframes with similar schemas:

df0.printSchema
root
 |-- single: integer (nullable = false)
 |-- double: integer (nullable = false)

and:

df1.printSchema
root
 |-- newColumn: integer (nullable = false)
 |-- single: integer (nullable = false)
 |-- double: double (nullable = false)

Now I merge these two schemas like below and create a new dataframe with this merged schema:

val consolidatedSchema = df0.schema.  :(df1.schema).toSet

val uniqueConsolidatedSchemas = StructType(consolidatedSchema.toSeq)

val emptyDF = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], uniqueConsolidatedSchemas)

emptyDF.printSchema
root
 |-- newColumn: integer (nullable = false)
 |-- single: integer (nullable = false)
 |-- double: integer (nullable = false)
 |-- double: double (nullable = false)

But as you can see, I have two fields with name of double, but with different data types. How can I keep the one that its data type is matched with the one in df0 schema and drop the other one?

I want the final schema to be like this:

finalDF.printSchema
root
 |-- newColumn: integer (nullable = false)
 |-- single: integer (nullable = false)
 |-- double: integer (nullable = false)

I really appreciate if you suggest any other method to merge these two schemas and reach my goal.

Thank you in advance.

CodePudding user response:

You can filter the second schema to exclude the fields that are already present in the first one before you concatenate the lists:

val uniqueConsolidatedSchemas = StructType(
  df0.schema    df1.schema.filter(c1 =>
    !df0.schema.exists(c0 => c0.name == c1.name)
  )
)
  • Related