For one of the data cleaning steps, I would like to gather insight into how the unique values are existing as a percentage of the total row count so that I can apply a threshold and decide if I should completely remove this column / feature. For this I came up with this function as below:
def uniqueValuesAsPercentage(data: DataFrame) = {
val (rowCount, columnCount) = shape(data)
data.selectExpr(data.head().getValuesMap[Long](data.columns).map(elem => {
val (columnName, uniqueCount) = elem
val percentage = uniqueCount / rowCount * 100
(columnName, uniqueCount, percentage)
}))
}
But it fails with the following error:
<console>:90: error: type mismatch;
found : scala.collection.immutable.Iterable[(String, Long, Long)]
required: String
data.selectExpr(data.head().getValuesMap[Long](data.columns).map(elem => {
Unfortunately since this is in the Apache Zeppelin notebook, I'm also missing the capabilities of an IDE. I have the IntelliJ untilate, but the Big Data Tools support seem not to be available for my version of the IDE. Very annoying!
Any ideas as to what the problem is here? I guess I'm messing with the DataFrame in the selectExpr(....). As it can be seen, that I'm returning a tuple with the information I calculate.
CodePudding user response:
You can calculate it in a much simpler way:
import org.apache.spark.sql.functions.{col, countDistinct, count}
import spark.implicits._
// define a dataframe for example
val data = Seq(("1", "1"), ("1", "2"), ("1", "3"), ("1", "4")).toDF("col_a", "col_b")
data.select(data.columns.map(c => (lit(100) * countDistinct(col(c)) / count(col(c))).alias(c)): _*).show()
// output:
----- -----
|col_a|col_b|
----- -----
| 25.0|100.0|
----- -----