I want to pull an HTML document from a website and then "edit" it using code (regex) in C#. For this I am trying to create a regular expression. My goal is to get everything in the Document between the Article-Tags.
The HTML document is only an excerpt. In this are many articles-Tags.
Unfortunately without success so far, maybe you have an idea for it.
My Try so far:
<article(.*)>(<("[^"]*"|'[^']*'|[^'">])*>|(\n)|)*<\/article>
The HTML-Document looks like this:
<div></div>
<div >
</div>
<div >
<article id="post-5555">
<div >
<div >
<div >
<div rel="category tag">Allgemein</a></div><header>
<a >Some Text</h3></a><div >
<span >
25. Juli 2022</span>
<span >
<span >by</span><span itemprop="author">
<a href="https://some link" rel="author">
Some text</a>
</span>
</span>
</div>
</header>
<div >
<p>some Text<a>Read More</a></p>
</div>
<footer>
</footer>
</div>
</div>
</div>
</article>
<div></div>
CodePudding user response:
Assuming your HTML is actually valid XHTML, you could parse it using the XDocument
class:
string html = "<div></div>"
"<div class=\"entry-content\">"
"</div>"
"<div class=\"kt_archivecontent \" > "
"<article id=\"post-5555\">"
"<div class=\"row\">"
"<div class=\"col-md-12 \">"
"<div class=\"post-text-inner\">"
"<div class=\"/\" rel=\"category tag\"><a>Allgemein</a></div><header>"
"<a ><h3>Some Text</h3></a><div class=\"post-top-meta kt_color_gray\">"
"<span class=\"postdate kt-post-date updated\">"
"25. Juli 2022</span>"
"<span class=\"postauthortop kt-post-author author vcard\">"
"<span class=\"kt-by-author\">by</span><span itemprop=\"author\">"
"<a href=\"https://some link\" class=\"fn kt_color_gray\" rel=\"author\">"
"Some text</a>"
"</span>"
"</span> "
"</div>"
"</header>"
"<div class=\"entry-content\">"
"<p>some Text<a>Read More</a></p>"
"</div>"
"<footer>"
"</footer>"
"</div>"
"</div>"
"</div>"
"</article></div>"
"<div></div>";
XDocument doc = XDocument.Parse("<root>" html "</root>");
XElement article = doc.Descendants("article").FirstOrDefault();
string s = article.ToString();
CodePudding user response:
As others have said parsing HTML with regular expressions comes with a health warning because it can go wrong in a number of ways. Particularly when there are comments in the HTML or nested tags. That said if you are just knocking up something scrappy and know the restrictions on the way the HTML was generated it can be an easy and quick way to go.
I suspect the issue you ran into is that the dot '.' character does not match new lines by default. In order to get it to do so you need to use the SingleLine option. Also, to stop it treating multiple articles as one capture, you would need to make the match non-greedy by adding a '?' after the '*'.
Something like this:
using System.Text.RegularExpressions;
namespace ConsoleApp1
{
internal class Program
{
static void Main(string[] args)
{
var testString = @"<div></div>
<div entry-content"">
</div>
<div kt_archivecontent "" >
<article id=""post-5555"">
<div row"">
<div col-md-12 "">
<div post-text-inner"">
<div /"" rel=""category tag"">Allgemein</a></div><header>
<a >Some Text</h3></a><div post-top-meta kt_color_gray"">
<span postdate kt-post-date updated"">
25. Juli 2022</span>
<span postauthortop kt-post-author author vcard"">
<span kt-by-author"">by</span><span itemprop=""author"">
<a href=""https://some link"" fn kt_color_gray"" rel=""author"">
Some text</a>
</span>
</span>
</div>
</header>
<div entry-content"">
<p>some Text<a>Read More</a></p>
</div>
<footer>
</footer>
</div>
</div>
</div>
</article>
<article id=""post-5556"">
<div row"">
<div col-md-12 "">
<div post-text-inner"">
<div /"" rel=""category tag"">Allgemein</a></div><header>
<a >Some Text</h3></a><div post-top-meta kt_color_gray"">
<span postdate kt-post-date updated"">
25. Juli 2022</span>
<span postauthortop kt-post-author author vcard"">
<span kt-by-author"">by</span><span itemprop=""author"">
<a href=""https://some link"" fn kt_color_gray"" rel=""author"">
Some text</a>
</span>
</span>
</div>
</header>
<div entry-content"">
<p>some Text<a>Read More</a></p>
</div>
<footer>
</footer>
</div>
</div>
</div>
</article>
<div></div>";
var matches = Regex.Matches(testString, "<article(.*?)>.*?<\\/article>",RegexOptions.Singleline);
foreach (var match in matches)
{
Console.WriteLine($"{match}\n{new String('=', 80)}");
}
}
}
}
Another way to apply the SingleLine option is to prepend it to the regular expression like this:
var matches = Regex.Matches(testString, "(?s)<article(.*?)>.*?<\\/article>");
If you want to be helpful to any future developers that maintain the code you can tell the pattern to ignore white space and add comments like this:
var matches = Regex.Matches(
testString,
@"(?sx) # Options single line (s) and ignore pattern whitespace (x)
< # Match the opening of a tag
article # Match the name of the article tag
(.*?) # Match the attributes of the article tag if there are any
> # Match the end of the opening article tag
.*? # Match the minimum number of characters required to create a match
<\/article> # Match the closing </article> tag");