I have 2 dataframes, the first one has 53 columns and the second one has 132 column. I want to compare the 2 dataframes and remove all the columns that are not in common between the 2 dataframes and then display each dataframe containing only those columns that are common.
What I did so far is to get a list of all the column that dont't match, but I don't know how to drop them.
val diffColumns = df2.columns.toSet.diff(df1.columns.toSet).union(df1.columns.toSet.diff(df2.columns.toSet))
This is getting me a scala.collection.immutable.Set[String]. Now I'd like to use this to drop these columns from each dataframe. Something like that, but this is not working...
val newDF1 = df1.drop(diffColumns)
CodePudding user response:
The .drop
function accepts a list of columns, not the Set
object, so you need to convert it to Seq and "expand it" using, the : _*
syntax, like, this:
df.drop(diffColumns.columns.toSet.toSeq: _*)
Also, instead of generating diff, it could be just easier to do intersect to find common columns, and use .select
on each dataframe to get the same columns:
val df = spark.range(10).withColumn("b", rand())
val df2 = spark.range(10).withColumn("c", rand())
val commonCols = df.columns.toSet.intersect(df2.columns.toSet).toSeq.map(col)
df.select(commonCols: _*)
df2.select(commonCols: _*)