Spark UDF error when input parameter is a value concatenated from two columns of a dataframe-CodePudding

Following python code loads a csv file into dataframe df and sends a string value from single or multiple columns of df to UDF function testFunction(...). The code works fine if I send a single column value. But if I send a value df.address " " df.city from two columns of df, I get the following error:

Question: What I may be doing wrong, and how can we fix the issue? All the columns in df are NOT NULL so null or empty string should not be an I issue. For example if I send single column value df.address, that value has blank spaces (e.g. 123 Main Street). So, why the error when two columns' concatenated values are sent to UDF?

Error:

PythonException: An exception was thrown from a UDF: 'AttributeError: 'NoneType' object has no attribute 'upper''

from pyspark.sql.types import StringType
from pyspark.sql import functions as F

df = spark.read.csv(".......dfs.core.windows.net/myDataFile.csv", header="true", inferSchema="true")

def testFunction(value):
  mystr = value.upper().replace(".", " ").replace(",", " ").replace("  ", " ").strip()
  return mystr

newFunction = F.udf(testFunction, StringType())

df2 = df.withColumn("myNewCol", newFunction(df.address   " "   df.city))
df2.show()

CodePudding user response：

In PySpark you cannot concatenate StringType columns together using . It will return null which breaks your udf. You can use concat instead.

df2 = df.withColumn("myNewCol", newFunction(F.concat(df.address, F.lit(" "), df.city)))