Home > Net >  Scala Dataframe column split URL parameters to new columns
Scala Dataframe column split URL parameters to new columns

Time:12-29

I have URL data in a column in my dataframe that I need to parse out parameters from the query string and create new columns for.

Sometimes the parameters will exist, sometimes they won't, and they aren't in a specific guaranteed order so I need to be able to find them by name. I am writing this in Qcala but can't get the syntax correct and would love some help.

My code:


val df = Seq(
  (1, "https://www.mywebsite.com/dummyurl/single?originlatitude=35.0133612060147&originlongitude=-116.156211232302&origincountrycode=us&originstateprovincecode=ca&origincity=boston&originradiusmiles=250&datestart=2021-12-23t00:00:00"),
  (2, "https://www.mywebsite.com/dummyurl/single?originlatitude=19.9141319141121&originlongitude=-56.1241881401291&origincountrycode=us&originstateprovincecode=pa&origincity=york&originradiusmiles=100&destinationlatitude=40.7811012268066&destinationlon")
).toDF("key", "URL")

val result = df

// .withColumn("param_name", $"URL")
.withColumn("parsed_url", explode(split(expr("parse_url(URL, 'QUERY')"), "&")))
.withColumn("parsed_url2", split($"parsed_url", "="))
// .withColumn("exampletest",$"URL".map(kv: String => (kv.split("=")(0), kv.split("=")(1))) )
.withColumn("Search_OriginLongitude", split($"URL","\\?"))
.withColumn("Search_OriginLongitude2", split($"Search_OriginLongitude"(1),"&"))

//   .map(kv: Any => (kv.split("=")(0), kv.split("=")(1)))
//   .toMap
//   .get("originlongitude"))

display(result)

Desired Result:

 --- -------------------- -------------------- -------------------- 
|KEY|                 URL|     originlatitude |    originlongitude |
 --- -------------------- -------------------- -------------------- 
|  1|https://www.myweb...| 35.0133612060147   | -116.156211232302  |
|  2|https://www.myweb...| 19.9141319141121   | -56.1241881401291  |
 --- -------------------- -------------------- -------------------- 

CodePudding user response:

parse_url function can actually take a third parameter key for the query parameter name you want to extract, like this:

val result = df
  .withColumn("Search_OriginLongitude", expr("parse_url(URL, 'QUERY', 'originlatitude')"))
  .withColumn("Search_OriginLongitude2", expr("parse_url(URL, 'QUERY', 'originlongitude')"))

result.show
// --- -------------------- ---------------------- ----------------------- 
//|key|                 URL|Search_OriginLongitude|Search_OriginLongitude2|
// --- -------------------- ---------------------- ----------------------- 
//|  1|https://www.myweb...|      35.0133612060147|      -116.156211232302|
//|  2|https://www.myweb...|      19.9141319141121|      -56.1241881401291|
// --- -------------------- ---------------------- ----------------------- 

Or you can use str_to_map function to create a map of parameter->value like this:

val result = df
  .withColumn("URL", expr("str_to_map(split(URL,'[?]')[1],'&','=')"))
  .withColumn("Search_OriginLongitude", col("URL").getItem("originlatitude"))
  .withColumn("Search_OriginLongitude2", col("URL").getItem("originlongitude"))
  • Related