Can any of you please help me to write a regex pattern for the below requirement?
- Section tags that don't have numbers
- All section tag numbers that don't have a dot character followed by.
- Numbers that are closer to the section tag only that to be considered.
Test String:
<sectionb>2.3. Optimized test sentence<op>(</op>1,1<cp>)</cp></sectionb>
*<sectiona>2 Surface Model: ONGV<op>(</op>1,1<cp>)</cp></sectiona>*
<sectiona>3. Verification of MKJU<op>(</op>1,1<cp>)</cp> Entity</sectiona>
*<sectionc>3. 2. 1 <txt>Case 1</txt> Annual charges to SGX</sectionc>*
*<sectiona>Compound Interest<role>back</role></sectiona>*
Pattern:
<section[a-z]>[\d]*[^\.]*<\/section[a-z]
Regex Pattern Should Match the below string:
<sectiona>2 Surface Model: ONGV<op>(</op>1,1<cp>)</cp></sectiona>
<sectionc>3. 2 1 <txt>Case 1</txt> Annual charges to SGX</sectionc>
<sectiona>Compound Interest<role>back</role></sectiona>
CodePudding user response:
You can use this regex:
/<section[a-z]>([^\d]|\d (?![.])).*?<\/section[a-z]/g
Explanation:
<section[a-z]>
- match literal <section
and a letter and >
(
- begin a group
[^\d]|\d
- match either a non-digit OR a one or more digits
(?![.])
- NOT followed by a a dot .
)
- end group
.*?<\/section[a-z]>
- match any character zero or more times followed by the literal string </section
followed by a letter and >
This will not match if one or more numbers are followed by a dot.
CodePudding user response:
This matches the updated requirements:
<section\w >(((\d \.\s*)*(\d [^\.]))|[^\d]).*?<\/section\w>
<section\w >
\w
is mostly the same as [a-z]
with
to allow for 0 or more (<section>
<sectionabc>
), remove
for exactly one letter
(\d \.\s*)*
0 or more digit/dot/any number of spaces - match updated row 3 where it's now 3. 2. 1
with spaces after dots
(\d [^\.])
must match digit without a dot, one or more digits
((...)|[^\d])
or section does not start with a digit (match row 5)
.*?
followed by any character, as few as times as possible upto the following </section
- could likely do this with a look ahead to simplify the regex, but, for me, this keeps the separate "no digits" clause separate.