In Apache Spark DataFrame, how to drop all columns where all non None values are identical?-CodePudding

I have a DataFrame in Apache Spark SQL, where I want to drop all columns where all not None values are the same.

So in a dummy example

df  

|  A    |   B    |   C   |    
   1        2        3          
  NaN       2        4         
   1        2       NaN  
   1        2        5

I want to keep only column C

 df_filter

 |   C   |
     3
     4
    NaN
     5

In Python I would do it in the following way

nunique = df.fillna(df.median()).nunique()
cols_to_drop = nunique[nunique == 1].index
df = df.drop(cols_to_drop, axis=1)

But how would I do it in Apache Spark SQL DataFrame (Scala)?

CodePudding user response：

One way to go at it would be to use countDistinct on all columns. The function natively ignores null values:

val uniqueCounts = df
    .select(df.columns.map(c => countDistinct(col(c)) as c): _*)
    .head
val nonUniqueCols = df.columns
    .filter(c => uniqueCounts.getAs[Long](c) > 1)
val df_filter = df.select(nonUniqueCols.map(col) : _*)
df_filter.show

Note that in case you have NaN (not a number) values and not null values, they won't be ignored by countDistinct. If you want them to be, replace countDistinct(col(c)) by countDistinct(when(col(c) !== Double.NaN,col(c))) to transform NaN values into null values.