Regex to match anything except HTML tags when code is encoded using

I am trying to use regex to match any text except for HTML tags. I have found this solution for "normal" HTML code:

<[^>]*>(*SKIP)(*F)|[^<]

However, my code is encoded using < and > instead of < and >, and I have not been able to modify the regex above for it to work.

As an example, given the text:

Hi <p class=\"hello\">\r\nthere, how are you\r\n</p>

I need to match "hi" and "there, how are you". Note that I need to match text that is not between tags as well, "hi", in this example.

UPDATE: since I am using ruby's gsub, it looks like I cannot even use *SKIP and *F

UPDATE 2: I was trying not to get into much detail but seems to be important: I actually need to replace all the spaces from a text, but not those spaces that are part of a tag, be it a < ... > tag or a <...> tag.

CodePudding user response：

You can use

text = text.gsub(/(&lt;.*?&gt;|<[^>]*>)|[[:blank:]]/m) { $1 || '_' }

I suggest [[:blank:]] instead of \s since I assume you do not want to replace line breaks. See the Ruby demo.

The regex above matches

(<.*?>|<[^>]*>) - either <, any zero or more chars as few as possible, and > or <, then zero or more chars other than > and then a >
| - or
[[:blank:]] - any single horizontal whitespace (you may also use [\p{Zs}\t] to match any Unicode horizontal whitespace).

The { $1 || '_' } block in the replacement means that when Group 1 matches, the Group 1 value is returned as is, else, _ is returned as a replacement of a horizontal whitespace.