I got to the point of obtaining the html of the page with the following code:
#!/usr/bin/env racket
#lang racket/base
(require net/url racket/port)
(require (planet neil/html-parsing:3:0))
(define p (get-pure-port (string->url "https://www.rosettacode.org/wiki/Web_scraping")))
(define my-html (port->string p))
(close-input-port p)
How do I get the title, i.e. the text inside of <title>
tag, from my-html
?
CodePudding user response:
I prefer working with XML tooling over HTML, or in Racket (And scheme in general), sxml. That lets you use XPath-like queries to easily extract data from the document. Luckily, it's simple to parse HTML into a sxml expression:
#!/usr/bin/env racket
#lang racket
(require html-parsing)
(require sxml/sxpath)
(define my-html "<!doctype html><html><head><title>Title text here</title></head><body><p>a paragraph of text</p></body></html>")
(define document (html->xexp my-html))
; Returns a list of strings
(display-lines ((sxpath "/html/head/title/text()") document))
or in your case
(define document (call/input-url (string->url "https://www.rosettacode.org/wiki/Web_scraping")
get-pure-port html->xexp))
(html->xexp
takes either a string holding a HTML document or an input port)
The interesting bit is sexp
, which takes an SXPath string and returns a new procedure that when called in turn with a sxml argument, returns a list of all matches. If you're going to be looking for the same thing repeatedly, it's worth defining a new function instead of using a temporary:
(define get-title-text (sxpath "/html/head/title/text()"))
The html-parsing
and sxml
packages should be installed via the DrRacket package manager or from the command line with raco pkg install html-parsing sxml
, whichever you prefer.