Home > Software engineering >  Comparing Two DataFrame in Spark and Scala
Comparing Two DataFrame in Spark and Scala

Time:11-10

I'm having a DataFrame that loads a data from a CSV file and I'm trying out some data cleaning activities on this where one of the task is to look for the uniqueness as a percentage of the total number or rows and if that percentage is less than a certain threshold I would just drop those columns. Here is what I have so far:

val df = spark.read
  .format("csv")
  .option("delimiter", ";")
  .option("header", "true") //first line in file has headers
  //.option("mode", "DROPMALFORMED")
  .load("work/youtube_videos.csv")

val df2 = df.select(df.columns.map(c => (lit(100) * countDistinct(col(c)) / count(col(c))).alias(c)): _*)

val colsToDrop =  df2.selectExpr(df2.first().getValuesMap[Double](df2.columns).filter(elem => elem._2 < 1.0).keys.toSeq: _*).columns
colsToDrop foreach println

So now I have two DataFrame where I will have to use the colsToDrop DataFrame where I have identified the columns to get rid of and the original DataFrame. I can use this colsToDrop Array on the original DataFrame and get rid of them. Is this a good idea / approach? Any efficient approach?

CodePudding user response:

This is what I do currently to get what I want:

val newDF = df.drop(colsToDrop:_*)
newDF.show(false)
  • Related