Home > Software design >  what is the difference between udf from SparkSession and udf from pyspark.sql.functions
what is the difference between udf from SparkSession and udf from pyspark.sql.functions

Time:12-21

I have two ways to use udf in pyspark:

1.

spark = pyspark.sql.SparkSession.builder.getOrCreate()
print(spark.udf)
output:
<pyspark.sql.udf.UDFRegistration at 0x7f5532f823a0>
from pyspark.sql.functions import udf
print(udf)
output:
<function pyspark.sql.functions.udf(f=None, returnType=StringType)>

I do not understand what is the intended difference between the two. I get doubts like why are two APIs available.spark.udf has a method called register available with it. I think registering a udf is necessary. Then, why is it not available in pyspark.sql.functions. Why is it needed only for the first case?

Can you please help me clarify these doubts?

CodePudding user response:

spark.udf.register is used to register UDF to be invoked in Spark SQL query. While pyspark.sql.functions.udf is used to create UDF to be called when using DataFrame API.

Register UDF and use with SQL
from pyspark.sql.types import LongType

df = spark.range(1, 5)
df.createOrReplaceTempView("tb")

def plus_one(v):
    return v   1

spark.udf.register("plus_one_udf", plus_one, LongType())

spark.sql("select id, plus_one_udf(id) as id2 from tb").show()
# --- --- 
#| id|id2|
# --- --- 
#|  1|  2|
#|  2|  3|
#|  3|  4|
#|  4|  5|
# --- --- 
Using with DataFrame API
import pyspark.sql.functions as F

plus_one_udf = F.udf(plus_one, LongType())

df.withColumn("id2", plus_one_udf(F.col("id"))).show()

# --- --- 
#| id|id2|
# --- --- 
#|  1|  2|
#|  2|  3|
#|  3|  4|
#|  4|  5|
# --- --- 
  • Related