Home > Software engineering >  Kotlin, how can I read a dynamic website as text?
Kotlin, how can I read a dynamic website as text?

Time:07-05

As titled, I'm trying to read the content of sites like this one, which appears to be javascript based.

I tried using plain jdk lib, then jsoup and then htmlunit, but I couldn't get anything useful out of it (I see just the source code or just the title or null):

val url = URL("https://registry.terraform.io/providers/hashicorp/tls/latest/docs/data-sources/certificate")
val connection = url.openConnection()
val scanner = Scanner(connection.getInputStream())
scanner.useDelimiter("\\Z")
val content = scanner.next()
scanner.close()
println(content)

val doc = Jsoup.connect("https://registry.terraform.io/providers/hashicorp/tls/latest/docs/data-sources/certificate").get()
    println(doc.text())

WebClient().use { webClient ->
    val page = webClient.getPage<HtmlPage>("https://registry.terraform.io/providers/hashicorp/tls/latest/docs/data-sources/certificate")
    val pageAsText = page.asNormalizedText()
    println(pageAsText)
}

WebClient(BrowserVersion.FIREFOX).use { webClient ->
    val page = webClient.getPage<HtmlPage>("https://registry.terraform.io/providers/hashicorp/tls/latest/docs/data-sources/certificate")
    println(page.textContent)
}

It should be something easy peasy, but I cant see what's wrong

CodePudding user response:

In order for this to be possible, you need something to execute the JS that modifies the DOM.

It might be a bit overkill depending on the use case, and probably won't be possible if you're on Android, but one way to do this is to launch a headless browser separately and interact with it from your code. For instance, using Chrome Headless and the Chrome DevTools Protocol. If you're interested, I have written a Kotlin library called chrome-devtools-kotlin to interact with a Chrome browser in a type-safe way.

There might be simpler options, though. For instance maybe you can run an embedded browser instead with JBrowserDriver and still use JSoup to parse the HTML, as mentioned in this other answer.

CodePudding user response:

Regarding HtmlUnit:

the page has initially no content, all you see is rendered from javascript magic on the client side using one of this spa frameworks. It looks like there is some feature check in the beginning that figures out the js support in HtmlUnit does not have all the required features and based on this you only get a hint like "Please enable Javascript to use this application".

You can use

page.asXml()

to have a look at the content trough HtmlUnit's eyes.

You can open an HtmlUnit issue on github but i fear adding support for this will be a longer story.

  • Related