I have a spark dataframe created in Scala that contains geographical coordinates. I have to add column with country basing on this this geographical coordinates. I found some Python tools but as far as I know I can t use it in Scala code. Also I am not sure about efficiency if I have to process every row one by one by udf (it s about 50000 rows). Do you know how can I process this in the fastest possible way?
CodePudding user response:
You can use Pandas UDF if the tool you found is a python library. With this, it will parallelize your function and not apply it "row by row".
https://databricks.com/fr/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html