I'm having a DataFrame that loads a data from a CSV file and I'm trying out some data cleaning activities on this where one of the task is to look for the uniqueness as a percentage of the total number or rows and if that percentage is less than a certain threshold I would just drop those columns. Here is what I have so far:
val df = spark.read
.format("csv")
.option("delimiter", ";")
.option("header", "true") //first line in file has headers
//.option("mode", "DROPMALFORMED")
.load("work/youtube_videos.csv")
val df2 = df.select(df.columns.map(c => (lit(100) * countDistinct(col(c)) / count(col(c))).alias(c)): _*)
val colsToDrop = df2.selectExpr(df2.first().getValuesMap[Double](df2.columns).filter(elem => elem._2 < 1.0).keys.toSeq: _*).columns
colsToDrop foreach println
So now I have two DataFrame where I will have to use the colsToDrop DataFrame where I have identified the columns to get rid of and the original DataFrame. I can use this colsToDrop Array on the original DataFrame and get rid of them. Is this a good idea / approach? Any efficient approach?
CodePudding user response:
This is what I do currently to get what I want:
val newDF = df.drop(colsToDrop:_*)
newDF.show(false)