Home > other >  Simplifying a Regex expression used to find highlighted words with HTML tags
Simplifying a Regex expression used to find highlighted words with HTML tags

Time:10-16

Could you explain to me the following regex expression? Why does it have to be so complicated, all it has to do is check if the em tag is presented in a text, e.g. <em>text</em>, <pre>whatever</pre> or something like that, basically highlighted words. https://regex101.com/ pretty much explains what it does in detail.

Can it be simplified?

[Fact]
public async Task Handle_ShouldReturnHighlightedDescriptions_WhenGivenEmptyInput()
{
    // Arrange
    var query = new GetProductsQuery();

    // Act
    var actual = await _queryHandler.Handle(query, default);

    // Assert
    actual.Products.Should().AllSatisfy(p =>
        p.Description.Should().NotMatchRegex(@"<\s*([^ >] )[^>]*>.*?<\s*/\s*\1\s*>"));
}

CodePudding user response:

Depends on what the purpose is, but parsing HTML quickly becomes complicated.

If you only want to check whether an HTML string contains any tags, then just check for the pattern "<\w". But if the HTML also might contain comments (and CData sections), then you have to eliminate false positives like <!-- <em> -->, which is possible but requires a significantly more complex pattern.

But the pattern in your example seems like it is not only intended to detect tags, but also to extract the whole element from start to end tag. Unfortunately, this is even more complicated since the pattern does not take into consideration nested elements like <em>..<em>..</em>..</em>. If you need to support this you cannot do it with a regex alone, since regexes does not support matching recursive patterns.

  • Related