Home > Software engineering >  pyspark remove empty values from all columns and replace it with null
pyspark remove empty values from all columns and replace it with null

Time:12-14

I am trying to clean my dataset from missing values. In the rows there are values like

ID A B
1 324
2 Breda
3 null 34556

I would like to see in A1 and B2 null and so on without doing the cleaning column by column. I would like to loop over each columns without specifying the column names

I have found this code but the last raw returns an error : My table name is custom

def replaceEmptyCols(columns:Array[String]):Array[Column]={
    columns.map(c>={
      when(col(c)=="" ,null).otherwise(col(c)).alias(c)
    })
  }
custom.select(replaceEmptyCols(custom.columns):_*).show()

The error is :

SyntaxError: invalid syntax (, line 6)
  File "<command-447346330485202>", line 6
    custom.select(replaceEmptyCols(custom.columns):_*).show()
                                                  ^
SyntaxError: invalid syntax

CodePudding user response:

Maybe you are looking for something like this?

custom = spark.createDataFrame(
    [('1','','324')
    ,('2','Breda','')
    ,('3',None,'34556')
    ],
    ['ID','A','B']
)

custom.show()

#  --- ----- ----- 
# | ID|    A|    B|
#  --- ----- ----- 
# |  1|     |  324|
# |  2|Breda|     |
# |  3| null|34556|
#  --- ----- ----- 

import pyspark.sql.functions as F
from pyspark.sql.types import *


def replaceEmptyCols(df, columns:[]):
    for c in columns:
        df = df.withColumn(c, F.when((F.col(c) == '')  | (F.col(c) == None), F.lit('null')
                                    ).otherwise(F.col(c)))
    return df

replaceEmptyCols(custom, [c for c in custom.columns if c not in ['ID']]).show()
        
#  --- ----- ----- 
# | ID|    A|    B|
#  --- ----- ----- 
# |  1| null|  324|
# |  2|Breda| null|
# |  3| null|34556|
#  --- ----- ----- 
  • Related