I want to do some calculations and add that to an existing dataframe. I have the following function to calculate the address space based on the longitude and lattitude.
def getH3Address(x: Double, y: Double): String ={
h3.get.geoToH3Address(x,y)
}
I created a Dataframe with the following schema:
root
|-- lat: double (nullable = true)
|-- lon: double (nullable = true)
|-- elevation: integer (nullable = true)
I want to add/append a new column to this Dataframe called H3Address
, where the address space is calculated based on the input of the lat
and lon
of that row.
Here a short part of the dataframe I want to achieve:
---- ------------------ --------- ---------
| lat| lon|elevation|H3Address|
---- ------------------ --------- ---------
|51.0| 3.0| 13| a3af83|
|51.0| 3.000277777777778| 13| a3zf83|
|51.0|3.0005555555555556| 12| a1qf82|
|51.0|3.0008333333333335| 12| l3xf83|
I tried something like:
df.withColumn("H3Address", geoToH3Address(df.select(df("lat")), df.select(df("lon")))
But this didn't work.
Can someone help me out?
Edit:
After adding the suggestion of @Garib, I got the following lines:
val getH3Address = udf(
(lat: Double, lon: Double, res: Int) => {
h3.get.geoToH3Address(lat,lon,res).toString
})
var res : Int = 10
val DF_edit = df.withColumn("H3Address",
getH3Address(col("lat"), col("lon"), 10))
This thime, I get the error:
[error] type mismatch;
found : Int
required: org.apache.spark.sql.Column
How can I solve this error? Tried a lot of things. For example by using lit()
function
Edit2:
After using the correct way of lit(), the solution proposed has worked.
Solution:
df.withColumn("H3Address", getH3Address(col("lat"), col("lon"), lit(10)))
CodePudding user response:
You should create a UDF
out of your function.
User-Defined Functions (UDFs) are user-programmable routines that act on one row
For example:
val getH3Address = udf(
// write here the logic of your function. I used a dummy logic (x y) just for this example.
(x: Double, y: Double) => {
(x y).toString
})
val df = Seq((1, 2, "aa"), (2, 3, "bb"), (3, 4, "cc")).toDF("lat", "lon", "value")
df.withColumn("H3Address", getH3Address(col("lat"), col("lon"))).show()
You can read more about UDFs here: https://spark.apache.org/docs/latest/sql-ref-functions-udf-scalar.html