I am trying to extract domains from URLs.
Input:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val b = Seq(
("subdomain.example.com/test.php"),
("example.com"),
("example.buzz"),
("test.example.buzz"),
("subdomain.example.co.uk"),
).toDF("raw_url")
var c = b.withColumn("host", callUDF("parse_url", $"raw_url", lit("HOST"))).show()
Expected results:
-------------------------------- ---------------
| raw_url | host |
-------------------------------- ---------------
| subdomain.example.com/test.php | example.com |
| example.com | example.com |
| example.buzz | example.buzz |
| test.example.buzz | example.buzz |
| subdomain.example.co.uk | example.co.uk |
------------------------------- ---------------
Any advice much appreciated.
EDIT: based on the tip from @AlexOtt I have got a few steps closer.
import com.google.common.net.InternetDomainName
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val b = Seq(
("subdomain.example.com/test.php"),
("example.com"),
("example.buzz"),
("test.example.buzz"),
("subdomain.example.co.uk"),
).toDF("raw_url")
var c = b.withColumn("host", callUDF("InternetDomainName.from", $"raw_url", topPrivateDomain)).show()
However, I clearly have not implemented it correctly with withColumn. Here is the error:
error: not found: value topPrivateDomain var c = b.withColumn("host", callUDF("InternetDomainName.from", $"raw_url", topPrivateDomain)).show()
EDIT 2:
Got some good pointers from @sarveshseri and after cleaning up some syntax errors, the following code is able to remove the subdomains from most of the URLs.
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import com.google.common.net.InternetDomainName
import java.net.URL
val b = Seq(
("subdomain.example.com/test.php"),
("example.com"),
//("example.buzz"),
//("test.example.buzz"),
("subdomain.example.co.uk"),
).toDF("raw_url")
val hostExtractUdf = org.apache.spark.sql.functions.udf {
(urlString: String) =>
val url = new URL("https://" urlString)
val host = url.getHost
InternetDomainName.from(host).topPrivateDomain().name()
}
var c = b.select("raw_url").withColumn("HOST",
hostExtractUdf(col("raw_url")))
.show(false)
However, it still does not work as expected. Newer suffixes like .buzz
and .site
and .today
cause the following error:
Caused by: java.lang.IllegalStateException: Not under a public suffix: example.buzz
CodePudding user response:
First you will need to add guava
to dependencies in build.sbt
.
libraryDependencies = "com.google.guava" % "guava" % "31.0.1-jre"
Now you can extract the host as follows,
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import com.google.common.net.InternetDomainName
import java.net.URL
// define the extraction udf
val hostExtractUdf = org.apache.spark.sql.functions.udf { (urlString: String) =>
val url = new URL("https://" urlString)
val host = url.getHost
InternetDomainName.from(host).topPrivateDomain().toString
}
val b = val b = Seq(
("subdomain.example.com/test.php"),
("example.com"),
("example.buzz"),
("test.example.buzz"),
("subdomain.example.co.uk"),
).toDF("raw_url")
val c = b.withColumn("HOST", hostExtractUdf(col("raw_url"))
CodePudding user response:
You can use some regex with regexp_extract
and regexp_replace
functions but not sure it can handle all cases like co.uk
. Here's an example:
val c = b.withColumn(
"HOST",
regexp_extract(col("raw_url"), "^(?:https?:\\/\\/)?(?:[^@\n] @)?(?:www.)?([^:\\/\\n?] )", 1)
).withColumn(
"HOST",
regexp_replace(col("HOST"), "^. \\.([^.] \\.[^.] )", "$1")
)
c.show(false)
// ------------------------------ ------------
//|raw_url |HOST |
// ------------------------------ ------------
//|subdomain.example.com/test.php|example.com |
//|example.com |example.com |
//|example.buzz |example.buzz|
//|test.example.buzz |example.buzz|
//|subdomain.example.co.uk |co.uk |
// ------------------------------ ------------
You may want to take a look at this answer to improve the regex.