I have big document with the following rigid structure:
<h1>Title 1</h1>
Article text
<h1>Title 2</h1>
Article text
<h1>Title 3</h1>
Article text
My aim is to create a list of lists always with title and the following article text up to the next title.
I tried:
var parts = Regex.Split(html2, @"(<h1>)").Where(l => l !=string.Empty).ToArray().Select(a => Regex.Split(a, @"(</h1>)")).ToArray();
But the result is not as expected. Any Ideas how to split the separate articles and the titles? Thanks!
CodePudding user response:
Parsing HTML with Regex is a bad idea, as described by this classic answer: https://stackoverflow.com/a/1732454/173322
If you had a specific tag to extract then it could maybe work but since you have multiple tags and freeform article text in between, I suggest you use a parsing engine like AngleSharp or HtmlAgilityPack to handle processing. It'll be faster and more reliable.
If you must stick with manual text parsing though, I would simply loop through each line, check if it starts with an tag, classify the lines as Titles or Article Text, then loop through again to strip out the tags from the Titles and pair with the Article text.
CodePudding user response:
As mentioned in the comment, you should use a HTML parse, but, if you want to give it a try with code, you could split the string, determine whether the splitted text is a title or an article and then, add the result on a list.
However, for this task you have to:
- Split the string by more than one characters, that is, split by the "
<h1>
" string. Credits: answer to: string.split - by multiple character delimiter. - (If I understand your question), you want a list of lists, you could use the code shown in this answer.
NOTE: This code assumes the string (i.e. your document's content) has equal amounts of titles and articles.
Here's the code I've made - hosted on dotnetfiddle.com as well:
// Variables:
string sample = "<h1>Title 1</h1>" "Article text" "<h1>Title 2</h1>" "Article text" "<h1>Title 3</h1>" "Article text";
// string.split - by multiple character delimiter
// Credit: https://stackoverflow.com/a/1254596/12511801
string[] arr = sample.Split(new string[]{"</h1>"}, StringSplitOptions.None);
// I store the "title" and "article" in separated lists - their content will be unified later:
List<string> titles = new List<string>();
List<string> articles = new List<string>();
// Loop the splitted text by "</h1>":
foreach (string s in arr)
{
if (s.StartsWith("<h1>"))
{
titles.Add(s.Split(new string[]{"<h1>"}, StringSplitOptions.None)[1]);
}
else
{
if (s.Contains("<h1>"))
{
// Position 0 is the article and the 1 position is the title:
articles.Add(s.Split(new string[]{"<h1>"}, StringSplitOptions.None)[0]);
titles.Add(s.Split(new string[]{"<h1>"}, StringSplitOptions.None)[1]);
}
else
{
// Leading text - it's an article by default.
articles.Add(s.Split(new string[]{"<h1>"}, StringSplitOptions.None)[0]);
}
}
}
// ------------
// Create a list of lists.
// Credit: https://stackoverflow.com/a/12628275/12511801
List<List<string>> myList = new List<List<string>>();
for (int i = 0; i < titles.Count; i )
{
myList.Add(new List<string>{"Title: " titles[i], "Article: " articles[i]});
}
// Print the results:
foreach (List<string> subList in myList)
{
foreach (string item in subList)
{
Console.WriteLine(item);
}
}
Result:
Title: Title 1
Article: Article text
Title: Title 2
Article: Article text
Title: Title 3
Article: Article text