Home > Mobile >  How to fill empty cells and any cell which contains only spaces with Null in Spark DataFrame?
How to fill empty cells and any cell which contains only spaces with Null in Spark DataFrame?

Time:12-01

I have a dataset which has empty cells, and also cells which contain only spaces (one or more). I want to convert all these cells into Null.

Sample dataset:

data = [("", "CA", " "), ("Julia", "", None),("Robert", "  ", None), ("Tom", "NJ", "   ")]
df = spark.createDataFrame(data,["name", "state", "code"])
df.show() 

enter image description here

I can convert empty cells by:

df = df.select( [F.when(F.col(c)=="", None).otherwise(F.col(c)).alias(c) for c in df.columns] )
df.show() 

enter image description here

And cells with one space:

df = df.select( [F.when(F.col(c)==" ", None).otherwise(F.col(c)).alias(c) for c in df.columns] )
df.show() 

enter image description here

But, I don't want to repeat the above codes for cells with 2, 3, or more spaces.

Is there any way I can convert those cells at once?

CodePudding user response:

You can additional use trim or regex_replace the column before you apply when-otherwise

Trim

df = df.select( [F.when(F.trim(F.col(c))=="", None).otherwise(F.col(c)).alias(c) for c in df.columns] )

Regex Replace

df = df.select( [F.when(F.regexp_replace(col(c), "^\s $", ""))=="", None).otherwise(F.col(c)).alias(c) for c in df.columns] )

CodePudding user response:

you could use trim to remove the spaces which leaves a blank and then check for blanks in all cells.

see example below

data_sdf. \
    selectExpr(*['if(trim({0}) = "", null, {0}) as {0}'.format(c) for c in data_sdf.columns]). \
    show()

#  ------ ----- ---- 
# |  name|state|code|
#  ------ ----- ---- 
# |  null|   CA|null|
# | Julia| null|null|
# |Robert| null|null|
# |   Tom|   NJ|null|
#  ------ ----- ---- 

the list comprehension would result in if expression statements for every column

['if(trim({0}) = "", null, {0}) as {0}'.format(c) for c in data_sdf.columns]

# ['if(trim(name) = "", null, name) as name',
#  'if(trim(state) = "", null, state) as state',
#  'if(trim(code) = "", null, code) as code']
  • Related