Home > front end >  Pyspark 2.7 Set StringType columns in a dataframe to 'null' when value is ""
Pyspark 2.7 Set StringType columns in a dataframe to 'null' when value is ""

Time:01-26

I have a DataFrame called good_df that has mixed types of columns. I'm trying to set any empty values to 'null' for columns of StringType. I would think the code below would work, but it's not.

self.good_df = self.good_df.select([when((col(c)=='') & (isinstance(self.good_df.schema[c].dataType, StringType)),'null').otherwise(col(c)).alias(c) for c in self.good_df.columns])

I'm looking at the error message and it's not giving me much in the way of clues:

Traceback (most recent call last): File "", line 1, in File "/usr/lib/python2.7/site-packages/pyspark/sql/column.py", line 116, in _ njc = getattr(self._jc, name)(jc) File "/usr/lib/python2.7/site-packages/py4j/java_gateway.py", line 1257, in call answer, self.gateway_client, self.target_id, self.name) File "/usr/lib/python2.7/site-packages/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/lib/python2.7/site-packages/py4j/protocol.py", line 332, in get_return_value format(target_id, ".", name, value)) Py4JError: An error occurred while calling o792.and. Trace: py4j.Py4JException: Method and([class java.lang.Boolean]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:274) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748)

Does anyone have any ideas on what is going on? Thank you!

CodePudding user response:

The error message you got:

py4j.Py4JException: Method and([class java.lang.Boolean]) does not exist

This means you're trying to apply AND operator between a Column expression and a literal Boolean value.

You need change this part:

(isinstance(self.good_df.schema[c].dataType, StringType))

to

from pyspark.sql.functions import lit

lit(isinstance(self.good_df.schema[c].dataType, StringType))

That said, actually you can move the condition to check the column type into the python list-comprehension directly:

self.good_df = self.good_df.select(*[
    when((col(c) == ''), 'null').otherwise(col(c)).alias(c) if t == "string" else col(c)
    for c, t in self.good_df.dtypes
])
  •  Tags:  
  • Related