Extracting body from HTML with Haskell-CodePudding

Haskell beginner over here!

I'm trying to parse an HTML String and extract the body from it. I'm using GHC Version 9.0.2 . I've tried to extract it using Regex. I'm using Text.Regex.TDFA (Version 1.3.1.2). I've checked the functionality of my Regex at regex101.com. Based on this question I've modified it to be in line with POSIX Extended Regular Expressions. But for some reason my code ( line 51 ) is still unable to match the body of the HTML.

So my question is why exactly is this happening and how to fix it? Or is there a better / easier way of HTML body extraction?

Thank you all in advance.

CodePudding user response：

Please do not capture HTML with a regex. HTML is a context-free language ^[wiki], a regex can (often) only parser regular languages ^[wiki] and thus can not capture HTML. Even if for a (very) specific problem, it can be done with a regex, it will result in a cumbersome regex that is hard to write, validate, and bugfix.

Haskell has a library named scalpel ^[hackage] which is quite effective in parsing HTML. You can for example extract the HTML in the <body> tag with:

{-# LANGUAGE OverloadedStrings #-}

import Text.HTML.Scalpel(innerHTML, scrapeStringLike)

scrapeStringLike myHtml (innerHTML "body")

with myHtml the string that contains the HTML of the page. Likely you want to more advanced scraping and scalpel allows to define a hierarchy of scrapers that each perform a small task to construct an andvanced parser.

CodePudding user response：

You're almost there.

<body.*> is too greedy
([\w|\W]) no need for pipe inside of a char set and it's missing a quantifier
<\/body> is fine

You need this:

<body.*?>([\w\W]*)<\/body>

https://regex101.com/r/9rVCUQ/1

Everyone is going to want to tell you that you shouldn't parse/extract html using regex. Use an html parsing library for more reliable results.