Home > Enterprise >  How to Decode GEOHASH Column using PySpark
How to Decode GEOHASH Column using PySpark

Time:11-04

I'm trying to decode the GEOHASH to Latitude and Longitude using the pygeohash library. Below is my code

import pygeohash as pgh
from pyspark.sql.types import StringType

udf1 = udf(lambda x: pgh.decode(x))
add_latlong = add.withColumn('location', udf1(col('GEOHASH')))

However, I'm getting the result below:

 ------------ -------------------- 
|     GEOHASH|            location|
 ------------ -------------------- 
|w284nyv39qzn|[Ljava.lang.Objec...|
|w0zqyr64nt4v|[Ljava.lang.Objec...|
|w2815pb0yfgr|[Ljava.lang.Objec...|
|w281xv1czv1t|[Ljava.lang.Objec...|
|w2r7cvc0m1bz|[Ljava.lang.Objec...|
 ------------ -------------------- 

I've come across this thread PySpark UDF Returns [Ljava.lang.Object;@] that mentioned to use StringType as the second parameter of the udf but I'm still seeing the same result as above. How do I get the latitude and longitude from here?

Appreciate your help

Update: I've used the solution from Jonathan Lam below and for completeness here's the code and dataframe.

udf1 = udf(lambda x: pgh.decode(x), ArrayType(FloatType()))
add_latlong = add.withColumn('location', udf1(col('GEOHASH'))).withColumn('Lat',col('location')[0]).withColumn('Long',col('location')[1])

 ------------ -------------------- -------- ---------- 
|     GEOHASH|            location|     lat|      long|
 ------------ -------------------- -------- ---------- 
|w2864utg8uyf|[3.189408, 101.73...|3.189408| 101.73035|
|w281hj25hzre|[3.017675, 101.42...|3.017675|101.425995|
|w2830hj8vzrp|[3.010423, 101.60...|3.010423|101.609375|
|w0zf5uepz8uk|[4.596367, 101.06...|4.596367| 101.06768|
|w2rkk6s97gvt|[2.167289, 111.63...|2.167289| 111.63843|
 ------------ -------------------- -------- ----------  

CodePudding user response:

I'm not sure if your case is the same as the link you provided, since you are using external package to do the transformation pgh.decode(x). Based on the docs:

pgh.decode(geohash='ezs42')
# >>> ('42.6', '-5.6')

I think you should use ArrayType(FloatType()) instead.

  • Related