I have a DataFrame in Apache Spark SQL, where I want to drop all columns where all not None values are the same.
So in a dummy example
df
| A | B | C |
1 2 3
NaN 2 4
1 2 NaN
1 2 5
I want to keep only column C
df_filter
| C |
3
4
NaN
5
In Python I would do it in the following way
nunique = df.fillna(df.median()).nunique()
cols_to_drop = nunique[nunique == 1].index
df = df.drop(cols_to_drop, axis=1)
But how would I do it in Apache Spark SQL DataFrame (Scala)?
CodePudding user response:
One way to go at it would be to use countDistinct
on all columns. The function natively ignores null
values:
val uniqueCounts = df
.select(df.columns.map(c => countDistinct(col(c)) as c): _*)
.head
val nonUniqueCols = df.columns
.filter(c => uniqueCounts.getAs[Long](c) > 1)
val df_filter = df.select(nonUniqueCols.map(col) : _*)
df_filter.show
Note that in case you have NaN
(not a number) values and not null
values, they won't be ignored by countDistinct
. If you want them to be, replace countDistinct(col(c))
by countDistinct(when(col(c) !== Double.NaN,col(c)))
to transform NaN
values into null
values.