I have a DataFrame called good_df
that has mixed types of columns. I'm trying to set any empty values to 'null'
for columns of StringType
. I would think the code below would work, but it's not.
self.good_df = self.good_df.select([when((col(c)=='') & (isinstance(self.good_df.schema[c].dataType, StringType)),'null').otherwise(col(c)).alias(c) for c in self.good_df.columns])
I'm looking at the error message and it's not giving me much in the way of clues:
Traceback (most recent call last): File "", line 1, in File "/usr/lib/python2.7/site-packages/pyspark/sql/column.py", line 116, in _ njc = getattr(self._jc, name)(jc) File "/usr/lib/python2.7/site-packages/py4j/java_gateway.py", line 1257, in call answer, self.gateway_client, self.target_id, self.name) File "/usr/lib/python2.7/site-packages/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/lib/python2.7/site-packages/py4j/protocol.py", line 332, in get_return_value format(target_id, ".", name, value)) Py4JError: An error occurred while calling o792.and. Trace: py4j.Py4JException: Method and([class java.lang.Boolean]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:274) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748)
Does anyone have any ideas on what is going on? Thank you!
CodePudding user response:
The error message you got:
py4j.Py4JException: Method and([class java.lang.Boolean]) does not exist
This means you're trying to apply AND
operator between a Column
expression and a literal Boolean
value.
You need change this part:
(isinstance(self.good_df.schema[c].dataType, StringType))
to
from pyspark.sql.functions import lit
lit(isinstance(self.good_df.schema[c].dataType, StringType))
That said, actually you can move the condition to check the column type into the python list-comprehension directly:
self.good_df = self.good_df.select(*[
when((col(c) == ''), 'null').otherwise(col(c)).alias(c) if t == "string" else col(c)
for c, t in self.good_df.dtypes
])