I have a html code structured like this:
<html>
<body>
...
<p >
Text
Some more text
Even more text
</p>
...
<p >
Bla
Bla
Read more at
<a href="..." >Link</a>
</p>
</html>
How can I replace all newlines between <p>
tags with <br />
in C#? All other tags shouldn't match.
That's the Regex I currently have. But it doesn't work as I expect it to.
/(?<=<p).*(\n).*(?=<\/p>)/gs
CodePudding user response:
The problem is removing some piece of text between two other pieces of text canbe solved with a regex in C# in the following way:
var start = @"<p[\s>]";
var end = @"</p>";
var pattern = $@"(?s){start}.*?{end}";
var result = Regex.Replace(text, pattern, m =>
m.Value.Replace("\n", "<br />"));
Here, (?s)<p[\s>].*?</p>
will find all substrings between <p
(followed with whitespace or >
) and </p>
, and then m => m.Value.Replace("\n", "<br />")
will replace LF symbols with <br />
in the match values.
If the linebreaks can be mixed, you will need to use another Regex.Replace
call, and replace the m.Value.Replace("\n", "<br />")
with
Regex.Replace(m.Value, "\r\n?|\n", "<br />")
Or, if you plan to shrink consecutive line breaks into a single <br />
:
Regex.Replace(m.Value, "(?:\r\n?|\n) ", "<br />")
CodePudding user response:
You should not parse HTML with regex. Instead use a HTML parser library like HtmlAgilityPack:
string html = @"
<html>
<body>
...
<p class1"" >
Text
Some more text
Even more text
</p >
...
<p class2"" >
Bla
Bla
Read more at
<a href=""..."" > Link</a>
</p>
</body>
</html>";
HtmlDocument doc = new();
doc.LoadHtml(html);
var allParagraphs = doc.DocumentNode.SelectNodes( "//p//text()").Cast<HtmlTextNode>().ToList();
allParagraphs.ForEach(p => p.Text = p.Text.Replace(Environment.NewLine, "<br />"));
html = doc.DocumentNode.OuterHtml;