Home > Software engineering >  Spark scala how to remove the columns that are not in common between 2 dataframes
Spark scala how to remove the columns that are not in common between 2 dataframes

Time:04-30

I have 2 dataframes, the first one has 53 columns and the second one has 132 column. I want to compare the 2 dataframes and remove all the columns that are not in common between the 2 dataframes and then display each dataframe containing only those columns that are common.

What I did so far is to get a list of all the column that dont't match, but I don't know how to drop them.

    val diffColumns = df2.columns.toSet.diff(df1.columns.toSet).union(df1.columns.toSet.diff(df2.columns.toSet))

This is getting me a scala.collection.immutable.Set[String]. Now I'd like to use this to drop these columns from each dataframe. Something like that, but this is not working...

    val newDF1 = df1.drop(diffColumns)

CodePudding user response:

The .drop function accepts a list of columns, not the Set object, so you need to convert it to Seq and "expand it" using, the : _* syntax, like, this:

df.drop(diffColumns.columns.toSet.toSeq: _*)

Also, instead of generating diff, it could be just easier to do intersect to find common columns, and use .select on each dataframe to get the same columns:

val df = spark.range(10).withColumn("b", rand())
val df2 = spark.range(10).withColumn("c", rand())
val commonCols = df.columns.toSet.intersect(df2.columns.toSet).toSeq.map(col)
df.select(commonCols: _*)
df2.select(commonCols: _*)
  • Related