Home > front end >  Regex that matches all the text between two specific strings except if it contains another specific
Regex that matches all the text between two specific strings except if it contains another specific

Time:12-31

For example, if I want to match all text between two different tags, as long as the first tag doesn't appear again within the text in between.

So let's say the specific strings I want to match between are "<tag 1>hello</tag 1>" and "<tag 2>hi there</tag 2>" and the specific string I don't want in between them is "<tag 1>"

So I'd want a match with this:

<tag 1>hello</tag 1>
a bunch of text that includes newlines
<tag 2>hi there</tag 2>

But not a match with this:

<tag 1>hello</tag 1>
a bunch of text that includes newlines
<tag 2>something other than hi there</tag 2>
<tag 1>something other than hello</tag 1>
a bunch of text that includes newlines
<tag 2>hi there</tag 2>

I've tried

<tag 1>hello</tag 1>[\S\s]*?(?=<tag 1>|$)<tag 2>hi there</tag 2>

Which doesn't work.. just doesn't match anything.

I'll be using python with this, so python regex dialect would be good.

CodePudding user response:

"<tag 1>hello</tag 1>.*(\n .*|\s)*(?:(?!tag 1).)*(\n .*|\s)*.*<tag 2>hi there</tag 2>"mg

This regex:

  • (?:(?!tag 1).)* - excludes tag 1 string as a non-capturing group
  • (\n .*|\s)* - matches text on multiple lines. * at the end of the expression allows multiple new lines between 2 strings
  • <tag 1>hello</tag 1>.* ... .*<tag 2>hi there</tag 2> - matches everything between the strings <tag 1>hello</tag 1> and <tag 2>hi there</tag 2>

regex101.com

CodePudding user response:

This worked

<tag 1>hello</tag 1>(?:[^<]*(?:<(?!tag 1)[^<]*)*?)<tag 2>hi there</tag 2>

Thanks to bobble bubble who suggested it in a comment.

CodePudding user response:

<tag 1>hello<\/tag 1>(?:\n)?(?!<tag1>)[a-zA-Z\s\n]*<tag 2>hi there<\/tag 2>

regex101.com

<tag 1>hello<\/tag 1>(?:\n)? matches the string '<tag 1>hello</tag 1>' and allows for a new line after it.

(?!<tag1>) makes sure the string ' does not appear (negative lookahead)

[a-zA-Z\s\n]* matches 0 or more letters, spaces and newlines

<tag 2>hi there<\/tag 2> matches the string '<tag 2>hi there</tag 2>'

  • Related