I have two ways to use udf
in pyspark:
1.
spark = pyspark.sql.SparkSession.builder.getOrCreate()
print(spark.udf)
output:
<pyspark.sql.udf.UDFRegistration at 0x7f5532f823a0>
from pyspark.sql.functions import udf
print(udf)
output:
<function pyspark.sql.functions.udf(f=None, returnType=StringType)>
I do not understand what is the intended difference between the two. I get doubts like why are two APIs available.spark.udf
has a method called register available with it. I think registering a udf
is necessary. Then, why is it not available in pyspark.sql.functions
. Why is it needed only for the first case?
Can you please help me clarify these doubts?
CodePudding user response:
spark.udf.register
is used to register UDF to be invoked in Spark SQL query. While pyspark.sql.functions.udf
is used to create UDF to be called when using DataFrame API.
Register UDF and use with SQL
from pyspark.sql.types import LongType
df = spark.range(1, 5)
df.createOrReplaceTempView("tb")
def plus_one(v):
return v 1
spark.udf.register("plus_one_udf", plus_one, LongType())
spark.sql("select id, plus_one_udf(id) as id2 from tb").show()
# --- ---
#| id|id2|
# --- ---
#| 1| 2|
#| 2| 3|
#| 3| 4|
#| 4| 5|
# --- ---
Using with DataFrame API
import pyspark.sql.functions as F
plus_one_udf = F.udf(plus_one, LongType())
df.withColumn("id2", plus_one_udf(F.col("id"))).show()
# --- ---
#| id|id2|
# --- ---
#| 1| 2|
#| 2| 3|
#| 3| 4|
#| 4| 5|
# --- ---