I have URL
data in a column in my dataframe that I need to parse out parameters from the query string and create new columns for.
Sometimes the parameters will exist, sometimes they won't, and they aren't in a specific guaranteed order so I need to be able to find them by name. I am writing this in Qcala but can't get the syntax correct and would love some help.
My code:
val df = Seq(
(1, "https://www.mywebsite.com/dummyurl/single?originlatitude=35.0133612060147&originlongitude=-116.156211232302&origincountrycode=us&originstateprovincecode=ca&origincity=boston&originradiusmiles=250&datestart=2021-12-23t00:00:00"),
(2, "https://www.mywebsite.com/dummyurl/single?originlatitude=19.9141319141121&originlongitude=-56.1241881401291&origincountrycode=us&originstateprovincecode=pa&origincity=york&originradiusmiles=100&destinationlatitude=40.7811012268066&destinationlon")
).toDF("key", "URL")
val result = df
// .withColumn("param_name", $"URL")
.withColumn("parsed_url", explode(split(expr("parse_url(URL, 'QUERY')"), "&")))
.withColumn("parsed_url2", split($"parsed_url", "="))
// .withColumn("exampletest",$"URL".map(kv: String => (kv.split("=")(0), kv.split("=")(1))) )
.withColumn("Search_OriginLongitude", split($"URL","\\?"))
.withColumn("Search_OriginLongitude2", split($"Search_OriginLongitude"(1),"&"))
// .map(kv: Any => (kv.split("=")(0), kv.split("=")(1)))
// .toMap
// .get("originlongitude"))
display(result)
Desired Result:
--- -------------------- -------------------- --------------------
|KEY| URL| originlatitude | originlongitude |
--- -------------------- -------------------- --------------------
| 1|https://www.myweb...| 35.0133612060147 | -116.156211232302 |
| 2|https://www.myweb...| 19.9141319141121 | -56.1241881401291 |
--- -------------------- -------------------- --------------------
CodePudding user response:
parse_url
function can actually take a third parameter key
for the query parameter name you want to extract, like this:
val result = df
.withColumn("Search_OriginLongitude", expr("parse_url(URL, 'QUERY', 'originlatitude')"))
.withColumn("Search_OriginLongitude2", expr("parse_url(URL, 'QUERY', 'originlongitude')"))
result.show
// --- -------------------- ---------------------- -----------------------
//|key| URL|Search_OriginLongitude|Search_OriginLongitude2|
// --- -------------------- ---------------------- -----------------------
//| 1|https://www.myweb...| 35.0133612060147| -116.156211232302|
//| 2|https://www.myweb...| 19.9141319141121| -56.1241881401291|
// --- -------------------- ---------------------- -----------------------
Or you can use str_to_map
function to create a map of parameter->value
like this:
val result = df
.withColumn("URL", expr("str_to_map(split(URL,'[?]')[1],'&','=')"))
.withColumn("Search_OriginLongitude", col("URL").getItem("originlatitude"))
.withColumn("Search_OriginLongitude2", col("URL").getItem("originlongitude"))