Following python code loads a csv
file into dataframe df
and sends a string value from single or multiple columns of df
to UDF
function testFunction(...)
. The code works fine if I send a single column value. But if I send a value df.address " " df.city
from two columns of df, I get the following error:
Question: What I may be doing wrong, and how can we fix the issue? All the columns in df
are NOT NULL so null or empty string should not be an I issue. For example if I send single column value df.address, that value has blank spaces (e.g. 123 Main Street). So, why the error when two columns' concatenated values are sent to UDF?
Error:
PythonException: An exception was thrown from a UDF: 'AttributeError: 'NoneType' object has no attribute 'upper''
from pyspark.sql.types import StringType
from pyspark.sql import functions as F
df = spark.read.csv(".......dfs.core.windows.net/myDataFile.csv", header="true", inferSchema="true")
def testFunction(value):
mystr = value.upper().replace(".", " ").replace(",", " ").replace(" ", " ").strip()
return mystr
newFunction = F.udf(testFunction, StringType())
df2 = df.withColumn("myNewCol", newFunction(df.address " " df.city))
df2.show()
CodePudding user response:
In PySpark you cannot concatenate StringType columns together using
. It will return null
which breaks your udf. You can use concat
instead.
df2 = df.withColumn("myNewCol", newFunction(F.concat(df.address, F.lit(" "), df.city)))