Home > Software engineering >  In Apache Spark DataFrame, how to drop all columns where all non None values are identical?
In Apache Spark DataFrame, how to drop all columns where all non None values are identical?

Time:10-28

I have a DataFrame in Apache Spark SQL, where I want to drop all columns where all not None values are the same.

So in a dummy example

df  

|  A    |   B    |   C   |    
   1        2        3          
  NaN       2        4         
   1        2       NaN  
   1        2        5    

I want to keep only column C

 df_filter

 |   C   |
     3
     4
    NaN
     5

In Python I would do it in the following way

nunique = df.fillna(df.median()).nunique()
cols_to_drop = nunique[nunique == 1].index
df = df.drop(cols_to_drop, axis=1)

But how would I do it in Apache Spark SQL DataFrame (Scala)?

CodePudding user response:

One way to go at it would be to use countDistinct on all columns. The function natively ignores null values:

val uniqueCounts = df
    .select(df.columns.map(c => countDistinct(col(c)) as c): _*)
    .head
val nonUniqueCols = df.columns
    .filter(c => uniqueCounts.getAs[Long](c) > 1)
val df_filter = df.select(nonUniqueCols.map(col) : _*)
df_filter.show

Note that in case you have NaN (not a number) values and not null values, they won't be ignored by countDistinct. If you want them to be, replace countDistinct(col(c)) by countDistinct(when(col(c) !== Double.NaN,col(c))) to transform NaN values into null values.

  • Related