Home > Software engineering >  Drop null column in a spark dataframe and print column name
Drop null column in a spark dataframe and print column name

Time:05-10

I have this dataframe :

 ------ ---------- ----------- 
|brand |Timestamp |Weight     |
 ------ ---------- ----------- 
|BR1   |1632899456|null       |
|BR1   |1632901256|null       |
|BR300 |1632901796|null       | 
|BR300 |1632899155|null       |
|BR200 |1632899155|null       |

And this list which contains the name of the columns:

val column_names : Seq[String] = Seq("brand", "Timestamp", "Weight")

I would like to go through this list, check if the correspondant column contains only null values, drop the column if it is the case and log a message containing the name of the column that was dropped.

In this case, the result would be :

 ------ ---------- 
|brand |Timestamp |
 ------ ---------- 
|BR1   |1632899456|
|BR1   |1632901256|
|BR300 |1632901796|
|BR300 |1632899155|
|BR200 |1632899155|

"THE COLUMN WEIGHT WAS DROPPED, IT CONTAINS ONLY NULL VALUES"

I am using Spark version 3.2.1 and SQLContext, with scala language

CodePudding user response:

you can use Dataset.summary which returns a DataFrame with statistics about every column. Then, use this DataFrame to get what columns have null value, or min=max=null. Then, drop those columns in original DF.

Example:

case class Test(field1: String, field2: String)
val df = List(Test("1",null), Test("2",null), Test("3",null)).toDF("field1", "field2")

scala> df.show()
 ------ ------ 
|field1|field2|
 ------ ------ 
|     1|  null|
|     2|  null|
|     3|  null|
 ------ ------ 

scala> df.summary("mean", "min", "max").show()
 ------- ------ ------ 
|summary|field1|field2|
 ------- ------ ------ 
|   mean|   2.0|  null|
|    min|     1|  null|
|    max|     3|  null|
 ------- ------ ------ 

CodePudding user response:

Null column names can be received with "min" function. Then this names can be printed, or dropped:

val column_names = Seq("brand", "Timestamp", "Weight")
val df = List(("1", null, 1), ("2", null, 2), ("3", null, 3)).toDF("brand", "Timestamp", "Weight")

val minColumns = column_names.map(name => min(name).alias(name))
val minValuesRow = df.select(minColumns: _*).first

val nullColumnNames = column_names
  .zipWithIndex
  .filter({ case (_, index) => minValuesRow.isNullAt(index) })
  .map(_._1)
  • Related