Home > database >  'TypeError: decoding str is not supported' when concatenating in udf pyspark
'TypeError: decoding str is not supported' when concatenating in udf pyspark

Time:06-06

I'm trying to create a simple UDF that concatenates 2 strings and a separator.

def stringConcat(separator: str, first: str, second: str):
    return first   separator   second
spark.udf.register("stringConcat_udf", stringConcat)
customerDf.select("firstname", "lastname", stringConcat_udf(lit("-"),"firstname", 
"lastname")).show()

This is the traceback:

An exception was thrown from a UDF: 'TypeError: decoding str is not supported'. Full traceback
below:
TypeError: decoding str is not supported

What is wrong with this?

CodePudding user response:

For one thing, PySpark already has a function called concat_ws (docs) which does just that:

from pyspark.sql import functions as fn
customerDf.select("firstname", "lastname", fn.concat_ws("-","firstname", "lastname").alias("joined")).show()

But if you still want to define this UDF, the spark.udf.register("stringConcat_udf", stringConcat) isn't stored anywhere, which means it's usable in spark queries, but you'd need to define it to use with pyspark dataframes (docs):

from pyspark.sql import functions as fn
from pyspark.sql.types import StringType
stringConcat_udf = fn.udf(stringConcat, StringType())
customerDf.select("firstname", "lastname", stringConcat_udf(fn.lit("-"),"firstname", "lastname").alias("joined")).show()

CodePudding user response:

After registering your UDF, you can call it using expr. Try this:

customerDf.select("firstname", "lastname", expr('stringConcat_udf("-", firstname, lastname)'))

This works:

from pyspark.sql import functions as F
customerDf = spark.createDataFrame([('Tom', 'Hanks')], ["firstname", "lastname"])

def stringConcat(separator: str, first: str, second: str):
    return first   separator   second
spark.udf.register("stringConcat_udf", stringConcat)
df = customerDf.select("firstname", "lastname", F.expr('stringConcat_udf("-", firstname, lastname)'))
df.show()
#  --------- -------- ---------------------------------------- 
# |firstname|lastname|stringConcat_udf(-, firstname, lastname)|
#  --------- -------- ---------------------------------------- 
# |      Tom|   Hanks|                               Tom-Hanks|
#  --------- -------- ---------------------------------------- 
  • Related