Home > Software engineering >  How to list column/columns in Pyspark Dataframe which has all the value as Null or '0'
How to list column/columns in Pyspark Dataframe which has all the value as Null or '0'

Time:06-08

I am trying to get list of columns from below dataframe which has all the value as Null or 0 .

 Col1 | Col2|  Col3|  Col4|  Col5|
 ----- ----- ------ ------ ------ 
|Jamer| null|     M|   0  |  TW18|
|XYZAM| null|     F|   0  |   0  |
|Nowra| null|  null|   0  | null |
 ----- ----- ------ ------ ------ 

Expected Output : Col2 & Col4

If I am using PySpark Column Class isNull then the result also include Col3 & Col5 as output which is not expected. Any suggestion ?

CodePudding user response:

I recommend replacing all 0 (and "0") values to null and then running summary("count") function to get the count of non-null values of each column. columns with 0 count mean all values are null (or zero replaced to null).

# create example 
df = spark.createDataFrame(
    [("Jamer", None, "M", 0, "TW18"),
     ("XYZAM", None, "F", 0, "0"),
     ("Nowra", None, None, 0, None)],
    "col1 string, col2 string, col3 string, col4 int, col5 string"
    )

# actual code 
df.replace(0, None).replace("0", None) \
.summary("count") \
.show()

 ------- ---- ---- ---- ---- ---- 
|summary|col1|col2|col3|col4|col5|
 ------- ---- ---- ---- ---- ---- 
|  count|   3|   0|   2|   0|   1|
 ------- ---- ---- ---- ---- ---- 

CodePudding user response:

I would count the nulls and compare the result to the total number of rows:

counted_nulls = df.select([(count(when(isnan(c) | col(c).isNull(), c)) == df.count()).alias(c) for c in df.columns])

Then, you can get the columns where all values are null like this:

[k for k,v in counted_nulls.collect()[0].asDict().items() if v is True]
  • Related