Home > database >  How to replace all newlines bewtween two HTML tags with Regex in c#
How to replace all newlines bewtween two HTML tags with Regex in c#

Time:11-11

I have a html code structured like this:

<html>
   <body>
      ...
      <p >
       Text
       Some more text
       Even more text
      </p>
      ...
      <p >
       Bla
       Bla
       Read more at
      <a href="..." >Link</a>
      </p>
</html>

How can I replace all newlines between <p> tags with <br /> in C#? All other tags shouldn't match.

That's the Regex I currently have. But it doesn't work as I expect it to.

/(?<=<p).*(\n).*(?=<\/p>)/gs

CodePudding user response:

The problem is removing some piece of text between two other pieces of text canbe solved with a regex in C# in the following way:

var start = @"<p[\s>]";
var end = @"</p>";
var pattern = $@"(?s){start}.*?{end}";
var result = Regex.Replace(text, pattern, m => 
    m.Value.Replace("\n", "<br />"));

Here, (?s)<p[\s>].*?</p> will find all substrings between <p (followed with whitespace or >) and </p>, and then m => m.Value.Replace("\n", "<br />") will replace LF symbols with <br /> in the match values.

If the linebreaks can be mixed, you will need to use another Regex.Replace call, and replace the m.Value.Replace("\n", "<br />") with

Regex.Replace(m.Value, "\r\n?|\n", "<br />")

Or, if you plan to shrink consecutive line breaks into a single <br />:

Regex.Replace(m.Value, "(?:\r\n?|\n) ", "<br />")

CodePudding user response:

You should not parse HTML with regex. Instead use a HTML parser library like HtmlAgilityPack:

string html = @"
<html>
    <body>
      ...
        <p class1"" >
            Text
                Some more text
            Even more text
        </p >
        ...
        <p class2"" >
            Bla
            Bla
            Read more at
            <a href=""..."" > Link</a>
        </p>
    </body>
</html>";

HtmlDocument doc = new();
doc.LoadHtml(html);
var allParagraphs = doc.DocumentNode.SelectNodes( "//p//text()").Cast<HtmlTextNode>().ToList();
allParagraphs.ForEach(p => p.Text = p.Text.Replace(Environment.NewLine, "<br />"));
html = doc.DocumentNode.OuterHtml;
  • Related