How to apply udf on a dataframe and on a column in scala?-CodePudding

I am beginner to scala. I tried scala REPL window in intellij. I have a sample df and trying to test udf function not builtin for understanding.

df:

scala> import org.apache.spark.sql.SparkSession
 val spark: SparkSession = SparkSession.builder.appName("elephant").config("spark.master", "local[*]").getOrCreate()
 val df = spark.createDataFrame(Seq(("A",1),("B",2),("C",3))).toDF("Letter", "Number")
 df.show()

output:

|Letter|Number|
 ------ ------ 
|     A|     1|
|     B|     2|
|     C|     3|
 ------ ------

udf for dataframe filter:

scala> def kill_4(n: String) : Boolean = {
     | if (n =="A"){ true} else {false}} // please validate if its correct ???

I tried

df.withColumn("new_col", kill_4(col("Letter"))).show() // please tell correct way???

error error: type mismatch

Second: I tried on a col list:

scala> val lst = df.select("Letter").collect().map(_(0)).toList
val res56: List[Any] = List(A, B, C)

lst.filter(kill_4) //please suggest ???

error mismatch type

CodePudding user response：

You can register udf and use it in code as follows:

import org.apache.spark.sql.functions.col

def kill_4(n: String) : Boolean = {
     if (n =="A"){ true } else {false}
}
 
val kill_udf = udf((x: String) => kill_4(x))

df.select(col("Letter"),col("Number")
    kill_udf(col("Letter")).as("Kill_4") ).show(false)

CodePudding user response：

Please look at the databricks documentation on scala user defined funcitons.

You do not need the spark session to create a dataframe. I removed that code.

Your function had a couple bugs. Since it is very small, I created a inline one. The udf() call allows the function to be used with dataframes. The call to register allows it to be used with Spark SQL.

A quick SQL statement shows the function works.

Last but not least, we need the udf() and col() functions for the last statement to work.

In short, these three snippets solve your problem.