Home > Blockchain >  Ignore words in some parts of the string while replacing using java regex
Ignore words in some parts of the string while replacing using java regex

Time:04-14

I am trying to detect and replace certain words in an html string. For e.g. I have a string like this:

"cool span<span class='Span'> span spanspan the span</span> what a span"

I want to replace just the word 'span' (with 'xyz' for example) in this string outside of the tags (or basically ignore all string between angular brackets). Expected result:

"cool xyz<span class='Span'> xyz spanspan the xyz</span> what a xyz"

I tried various regex patterns without any luck. I doubt if its even possible at this point with java regex.

Thanks in advance for any help :)

Edit: I found the regex that solves the problem - (?<!\<)\bspan\b(?!\>). Weirdly the post that submitted it got/was deleted. Thanks all for the responses.

CodePudding user response:

It is not possible with regex in general!

Regex can only validate words for regular languages (Type-3 grammars). Even simple expressions like validation a^{n}b^{n} is not possible with regex in general. a^{n}b^{n} meaning a regex validating all words that have the same number if "a" following the same number of "b" (aabb, aaabbb).

If you need to "store" Information about the string for upcoming parts of the string you will always run into problems with regex. In your case you need to store the information that there was an "<" before and was is closed again with ">" . You can extend you knowledge about this here https://en.wikipedia.org/wiki/Chomsky_hierarchy .

An exception could be made by limiting "n" to be finite for a^{n}b^{n}. For such a simple example a large regex, depending on n, could then validate it. It would however be pretty ugly. If you are really into it you can see this problem at work at validating ipv4 vs ipv6 adresses with regex. For ipv4 adresses it is easily possible. For ipv6 you need to "store something". It is however limited to 8 which makes it possible. The regex however looks ugly as hell Regular expression that matches valid IPv6 addresses

I hope my explanation was somehow understandable. Never liked the theoretical part about computer science.

tl;dr: I guess you don´t have any other choice than to write for loops with variables that store the current state of your iteration. The state beeing Is it possible to delete the matching string right now or not.

CodePudding user response:

You can try this one

.replaceAll("(?<!<)span(?!>)", "xyz")

https://regex101.com/r/SzN5hV/1

  • Related