Home > Blockchain >  Match text both inside and outside html tags, with grouping
Match text both inside and outside html tags, with grouping

Time:08-02

I'd like to process with regex the following string, such that I'd get groups of matches having and\or not-having html tags:

zero<pre>one</pre>two three<span>four</span>five
  • I also care about the content of the html tag.

Expected result (I denote the group numbering with x, because I'm not sure what would be the result group number):

match 1: zero
match 2: <pre>one</pre>, group x: one -- and other groups having pre tags
match 3: two three
match 4: <span>four</span>, group x: four -- and other groups having span tags
match 5: five

What I tried (live demo):

((<(.*?)>)?(.*?)(<\/(.*?)>))?

Or differently (live demo):

(<(.*?)>)?(?:.?)

Both don't work. I think I should just control for the beginning and ending (zero and five in the above example), but I can't get it right.

CodePudding user response:

I would replace the .*? everywhere with what you are really looking for.

  • When finding the tagname: [^>]
  • When finding text in tags: [^<]

The regular expression could be this:

((<([^>] )>)?([^<] )(<\/([^>] )>)?)?

Regex101 playground:
https://regex101.com/r/eXT7YR/1

  • Related