Notice: this is for Spark version 2.1.1.2.6.1.0-129
I have a spark dataframe (Python). I would like to replace all instances of 0 across the entirety of the dataframe (without specifying particular column names), with NULL values.
The following is the code that I have written:
my_df = my_df.na.replace(0, None)
The following is the error that I receive:
File "<stdin>", line 1, in <module>
File "/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line 1634, in replace
return self.df.replace(to_replace, value, subset)
File "/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line 1323, in replace
raise ValueError("value should be a float, int, long, string, list, or tuple")
ValueError: value should be a float, int, long, string, list, or tuple
CodePudding user response:
Apparently in Spark 2.1.1, df.na.replace
does not support None
. None
option is only available since 2.3.0, which is not applicable in your case.
To replace values dynamically (i.e without typing columns name manually), you can use either df.columns
or df.dtypes
. The latter will give you the option to compare datatype as well.
from pyspark.sql import functions as F
for c in df.dtypes:
if c[1] == 'bigint':
df = df.withColumn(c[0], F.when(F.col(c[0]) == 0, F.lit(None)).otherwise(F.col(c[0])))
# Input
# --- ---
# | id|val|
# --- ---
# | 0| a|
# | 1| b|
# | 2| c|
# --- ---
# Output
# ---- ---
# | id|val|
# ---- ---
# |null| a|
# | 1| b|
# | 2| c|
# ---- ---