Replacing html content in a string-CodePudding

I have a string that has html contents such as:

string myMessage = "Please the website for more information (<a class=\"link\" href=\"http://www.africau.edu/images/default/sample.pdf\" target=_blank\" id=\"urlLink\"> easy details given</a>)";

What I need in the end is:

string myMessage = "Please the website for more information http://www.africau.edu/images/default/sample.pdf easy details given";

I can do this replacing each string as myMessage = myMessage.Replace("string to replace", ""); but then I have to take in each string and replace it will empty. Could there be a better solution?

CodePudding user response：

If I understand you correctly you have a larger text with multiple occurrences of "<a ....>" and actually you want to replace that entire thing by simply only the URL given in the href.

Not sure if this makes it so much easier for you but you could use Regex.Matches something like e.g.

var myMessage = "Please the website for more information (<a class=\"link\" href=\"http://www.africau.edu/images/default/sample.pdf\" target=_blank\" id=\"urlLink\"> easy details given</a>)";

var matches = Regex.Matches(myMessage, "(. ?)<a. ?href=\"(. ?)\". ?<\\/a>(. ?)");

var strBuilder = new StringBuilder();
foreach (Match match in matches)
{
    var groups = match.Groups;

    strBuilder.Append(groups[1]) // Please the website for more information (
        .Append(groups[2]) // http://www.africau.edu/images/default/sample.pdf
        .Append(groups[3]); // )
}
    
Debug.Log(strBuilder.ToString());

So what does this do?

(. ?) will create a group for everything before the first encounter of the following <a => groups[1]
(<a. ?href=") matches everything starting with <a and ending with href=" => ignored
(. ?) will create a group for everything between href=" and the next " (so the URL) => groups[2]
(". ?<\/a>) matches everything from the " until the next </a> => ignored
(. ?) will create a group for everything after the </a> => groups[3]

and groups[0] is the entire match.

so finally we just want to combine

groups[1]   groups[2]   groups[3]

but in a loop so we find possibly multiple matches within the same string and it is simply more efficient to use a StringBuilder for that.

Result

Please the website for more information (http://www.africau.edu/images/default/sample.pdf)

you can simply adjust this to e.g. also remove the ( ) or include the text between the tags but I figured actually this makes the most sense for now.

CodePudding user response：

I personally don't like to rely on the string format always being what I expect as this can lead to errors down the road.

Instead, I offer two ways I can think of doing this:

Use regular expressions:

string myMessage = "Please the website for more information (<a class=\"link\" href=\"http://www.africau.edu/images/default/sample.pdf\" target=_blank\" id=\"urlLink\"> easy details given</a>)";
var capturePattern = @"(. )\(<a .*href.*?=""(.*?)"".*>(.*)</a>\)";
var regex = new Regex(capturePattern);
var captures = regex.Match(myMessage);

var newString = $"{captures.Groups[1]}{captures.Groups[2]}{captures.Groups[3]}";
Console.WriteLine(myMessage);
Console.WriteLine(newString);

Output:

Please the website for more information (<a href="http://www.africau.edu/images/default/sample.pdf" target=_blank" id="urlLink"> easy details given)

Please the website for more information http://www.africau.edu/images/default/sample.pdf easy details given

Of course, regular expressions are only as good as the cases you can think of/test. I wrote this up quickly just to illustrate so make sure to verify for other string variations.

The other way is using HTMLAgilityPack:

string myMessage = "Please the website for more information (<a class=\"link\" href=\"http://www.africau.edu/images/default/sample.pdf\" target=_blank\" id=\"urlLink\"> easy details given</a>)";
var doc = new HtmlDocument();
doc.LoadHtml(myMessage);
var prefix = doc.DocumentNode.ChildNodes[0].InnerText;
var url = doc.DocumentNode.SelectNodes("//a[@href]").First().GetAttributeValue("href", string.Empty);
var suffix= doc.DocumentNode.ChildNodes[1].InnerText   doc.DocumentNode.ChildNodes[2].InnerText;

var newString = $"{prefix}{url}{suffix}";
Console.WriteLine(myMessage);
Console.WriteLine(newString);

Output:

Please the website for more information (<a href="http://www.africau.edu/images/default/sample.pdf" target=_blank" id="urlLink"> easy details given)

Please the website for more information (http://www.africau.edu/images/default/sample.pdf easy details given)

Notice this method preserves the parenthesis around the link. This is because from the agility pack's perspective, the first parenthesis is part of the text of the node. You can always remove them with a quick replace.

This method adds a dependency but this library is very mature and has been around for a long time.

it goes without saying that for both methods, you should make sure to add [error handling] checks for unexpected conditions.

CodePudding user response：

You may find your solution here https://stackoverflow.com/a/1121515/8071163