Extract domain from URLs using scala-CodePudding

I am trying to extract domains from URLs.

Input:

    import org.apache.spark.sql._
    import org.apache.spark.sql.functions._
    val b = Seq(
        ("subdomain.example.com/test.php"),
        ("example.com"),
        ("example.buzz"),
        ("test.example.buzz"),
        ("subdomain.example.co.uk"),
    ).toDF("raw_url")
    var c = b.withColumn("host", callUDF("parse_url", $"raw_url", lit("HOST"))).show()

Expected results:

     -------------------------------- --------------- 
    | raw_url                        | host          |
     -------------------------------- --------------- 
    | subdomain.example.com/test.php | example.com   |
    | example.com                    | example.com   | 
    | example.buzz                   | example.buzz  |
    | test.example.buzz              | example.buzz  |
    | subdomain.example.co.uk        | example.co.uk |
     -------------------------------  ---------------

Any advice much appreciated.

EDIT: based on the tip from @AlexOtt I have got a few steps closer.

    import com.google.common.net.InternetDomainName
    import org.apache.spark.sql._
    import org.apache.spark.sql.functions._
    val b = Seq(
        ("subdomain.example.com/test.php"),
        ("example.com"),
        ("example.buzz"),
        ("test.example.buzz"),
        ("subdomain.example.co.uk"),
    ).toDF("raw_url")
    var c = b.withColumn("host", callUDF("InternetDomainName.from", $"raw_url", topPrivateDomain)).show()

However, I clearly have not implemented it correctly with withColumn. Here is the error:

error: not found: value topPrivateDomain var c = b.withColumn("host", callUDF("InternetDomainName.from", $"raw_url", topPrivateDomain)).show()

EDIT 2:

Got some good pointers from @sarveshseri and after cleaning up some syntax errors, the following code is able to remove the subdomains from most of the URLs.

    import org.apache.spark.sql.functions.udf
    import org.apache.spark.sql._
    import org.apache.spark.sql.functions._
    import com.google.common.net.InternetDomainName
    import java.net.URL

    val b = Seq(
       ("subdomain.example.com/test.php"),
       ("example.com"),
       //("example.buzz"),
       //("test.example.buzz"),
       ("subdomain.example.co.uk"),
       ).toDF("raw_url")

    val hostExtractUdf = org.apache.spark.sql.functions.udf { 
        (urlString: String) =>
        val url = new URL("https://"   urlString)
        val host = url.getHost
        InternetDomainName.from(host).topPrivateDomain().name()
    }

    var c = b.select("raw_url").withColumn("HOST", 
       hostExtractUdf(col("raw_url")))
        .show(false)

However, it still does not work as expected. Newer suffixes like .buzz and .site and .today cause the following error:

Caused by: java.lang.IllegalStateException: Not under a public suffix: example.buzz

CodePudding user response：

First you will need to add guava to dependencies in build.sbt.

libraryDependencies  = "com.google.guava" % "guava" % "31.0.1-jre"

Now you can extract the host as follows,

import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import com.google.common.net.InternetDomainName

import java.net.URL

// define the extraction udf
val hostExtractUdf = org.apache.spark.sql.functions.udf { (urlString: String) =>
  val url = new URL("https://"   urlString)
  val host = url.getHost
  InternetDomainName.from(host).topPrivateDomain().toString
}

val b = val b = Seq(
  ("subdomain.example.com/test.php"),
  ("example.com"),
  ("example.buzz"),
  ("test.example.buzz"),
  ("subdomain.example.co.uk"),
).toDF("raw_url")

val c = b.withColumn("HOST", hostExtractUdf(col("raw_url"))

CodePudding user response：

You can use some regex with regexp_extract and regexp_replace functions but not sure it can handle all cases like co.uk. Here's an example:

val c = b.withColumn(
  "HOST",
  regexp_extract(col("raw_url"), "^(?:https?:\\/\\/)?(?:[^@\n] @)?(?:www.)?([^:\\/\\n?] )", 1)
).withColumn(
  "HOST",
  regexp_replace(col("HOST"), "^. \\.([^.] \\.[^.] )", "$1")
)

c.show(false)
// ------------------------------ ------------ 
//|raw_url                       |HOST        |
// ------------------------------ ------------ 
//|subdomain.example.com/test.php|example.com |
//|example.com                   |example.com |
//|example.buzz                  |example.buzz|
//|test.example.buzz             |example.buzz|
//|subdomain.example.co.uk       |co.uk       |
// ------------------------------ ------------

You may want to take a look at this answer to improve the regex.