Home > Back-end >  PySpark replace() function does not replace integer with NULL value
PySpark replace() function does not replace integer with NULL value

Time:10-25

Notice: this is for Spark version 2.1.1.2.6.1.0-129

I have a spark dataframe (Python). I would like to replace all instances of 0 across the entirety of the dataframe (without specifying particular column names), with NULL values.

The following is the code that I have written:

my_df = my_df.na.replace(0, None)

The following is the error that I receive:

  File "<stdin>", line 1, in <module>
  File "/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line 1634, in replace
    return self.df.replace(to_replace, value, subset)
  File "/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line 1323, in replace
    raise ValueError("value should be a float, int, long, string, list, or tuple")
ValueError: value should be a float, int, long, string, list, or tuple 

CodePudding user response:

Apparently in Spark 2.1.1, df.na.replace does not support None. None option is only available since 2.3.0, which is not applicable in your case.

To replace values dynamically (i.e without typing columns name manually), you can use either df.columns or df.dtypes. The latter will give you the option to compare datatype as well.

from pyspark.sql import functions as F

for c in df.dtypes:
    if c[1] == 'bigint':
        df = df.withColumn(c[0], F.when(F.col(c[0]) == 0, F.lit(None)).otherwise(F.col(c[0])))

# Input
#  --- --- 
# | id|val|
#  --- --- 
# |  0|  a|
# |  1|  b|
# |  2|  c|
#  --- --- 

# Output
#  ---- --- 
# |  id|val|
#  ---- --- 
# |null|  a|
# |   1|  b|
# |   2|  c|
#  ---- --- 
  • Related