Home > Back-end >  Adding the result of a function in a Dataframe column [Spark Scala]
Adding the result of a function in a Dataframe column [Spark Scala]

Time:10-13

I want to do some calculations and add that to an existing dataframe. I have the following function to calculate the address space based on the longitude and lattitude.

def getH3Address(x: Double, y: Double): String ={
    h3.get.geoToH3Address(x,y)
  }

I created a Dataframe with the following schema:

root
 |-- lat: double (nullable = true)
 |-- lon: double (nullable = true)
 |-- elevation: integer (nullable = true)

I want to add/append a new column to this Dataframe called H3Address, where the address space is calculated based on the input of the lat and lon of that row.

Here a short part of the dataframe I want to achieve:

 ---- ------------------ --------- --------- 
| lat|               lon|elevation|H3Address|
 ---- ------------------ --------- --------- 
|51.0|               3.0|       13|   a3af83|
|51.0| 3.000277777777778|       13|   a3zf83|
|51.0|3.0005555555555556|       12|   a1qf82|
|51.0|3.0008333333333335|       12|   l3xf83|

I tried something like:

df.withColumn("H3Address", geoToH3Address(df.select(df("lat")), df.select(df("lon")))

But this didn't work.

Can someone help me out?

Edit:

After adding the suggestion of @Garib, I got the following lines:

val getH3Address = udf(
      (lat: Double, lon: Double, res: Int) => {
        h3.get.geoToH3Address(lat,lon,res).toString
      })
    var res : Int = 10

    val DF_edit = df.withColumn("H3Address", 
 getH3Address(col("lat"), col("lon"), 10))

This thime, I get the error:

[error]  type mismatch;
  found   : Int
  required: org.apache.spark.sql.Column

How can I solve this error? Tried a lot of things. For example by using lit() function

Edit2:

After using the correct way of lit(), the solution proposed has worked.

Solution: df.withColumn("H3Address", getH3Address(col("lat"), col("lon"), lit(10)))

CodePudding user response:

You should create a UDF out of your function.

User-Defined Functions (UDFs) are user-programmable routines that act on one row

For example:

val getH3Address = udf(
  // write here the logic of your function. I used a dummy logic (x y) just for this example.
  (x: Double, y: Double) => {
    (x   y).toString
  })

val df = Seq((1, 2, "aa"), (2, 3, "bb"), (3, 4, "cc")).toDF("lat", "lon", "value")
df.withColumn("H3Address", getH3Address(col("lat"), col("lon"))).show()

You can read more about UDFs here: https://spark.apache.org/docs/latest/sql-ref-functions-udf-scalar.html

  • Related