Home > Blockchain >  NotePad Regex pattern should look for matches within pairs of XML tags
NotePad Regex pattern should look for matches within pairs of XML tags

Time:01-31

Below is the sample XML string. I want to match from the GROUP tag till the end of the 1st PARENT tag which has a value. But I want to restrict the regex to match only within a pair of <GROUP> </GROUP> tags.

<GROUP NAME="One">
<PARENT/>
<OTHERTAG1/>
</GROUP>
<GROUP NAME="Two">
<PARENT/>
<OTHERTAG1/>
<OTHERTAG2/>
</GROUP>
<GROUP NAME="Three">
<SomeTag1>
<PARENT>parent1</PARENT>
</GROUP>
<GROUP NAME="Four">
<PARENT>parent2</PARENT>
<OTHERTAG3/>
</GROUP>

I tried the following regex in NotePad :

<GROUP NAME="(. ?)">((?!GROUP).)*<PARENT>(. ?)</PARENT>

But it matches:

<GROUP NAME="One">
<PARENT/>
<OTHERTAG1/>
</GROUP>
<GROUP NAME="Two">
<PARENT/>
<OTHERTAG1/>
<OTHERTAG2/>
</GROUP>
<GROUP NAME="Three">
<SomeTag1>
<PARENT>parent1</PARENT>

Required output is:

<GROUP NAME="Three">
<SomeTag1>
<PARENT>parent1</PARENT>

and

<GROUP NAME="Four">
<PARENT>parent2</PARENT>

I am familiar with basic regex, but not with advanced regex. The objective is to replace the existing value of the PARENT tag, with the value of the NAME attribute of the GROUP tag. But I don't want to replace the empty PARENT tags. So, for example,

<GROUP NAME="Three">
<SomeTag1>
<PARENT>parent1</PARENT>

should become

<GROUP NAME="Three">
<SomeTag1>
<PARENT>Three</PARENT>

I don't want to write code for this; looking for regex pattern which can be fed in NotePad

EDIT 1:

Do not rely on the order of the tags. The only criteria is that the PARENT tag will be child of the GROUP tag. But there can be any number of tags before or after the PARENT tag. I have updated my samples to show this possibility.

The regex should always match from the start of the GROUP tag till the end of the PARENT tag having value. The match should not span multiple GROUP tags.

CodePudding user response:

You were close. Use this:

<GROUP NAME="([^"] )">\s*<PARENT>([^<]*)</PARENT>

Notes:

  • added \s* between tags to scan over whitespace
  • changed capture group to "([^"] )" which is faster than a non-greedy scan

Now, to replace the PARENT tag content with the GROUP NAME, do this search & replace:

(<GROUP NAME=")([^"] )(">\s*<PARENT>)([^<]*)
$1$2$3$2

E.g. create multiple groups, and shuffle the referenced groups in the replace.

Learn about regex: https://twiki.org/cgi-bin/view/Codev/TWikiPresentation2018x10x14Regex

UPDATE 1 with new requirement to have other tags in GROUP before PARENT:

(<GROUP NAME=")([^"] )(">(?:(?!<GROUP [\s\S] ?<PARENT)[\s\S])*?<PARENT>)([^<]*)
$1$2$3$2

Explanation of capture group 3:

  • ( -- capture group start
    • "> -- literal text
    • (?: -- non-capture group start
      • (?!<GROUP [\s\S] ?<PARENT) -- negative lookahead for GROUP tag that has PARENT tag
      • [\s\S] -- any single char, including newline (alternatively specify . and check the ▢ . matches newline checkbox)
    • )*? -- non-capture group end, repeat non-greedily multiple times
    • <PARENT> -- literal text
  • ) -- capture group end
  • Related