Regex pattern to find all the digits which don't have the immediate dot character-CodePudding

Can any of you please help me to write a regex pattern for the below requirement?

Section tags that don't have numbers
All section tag numbers that don't have a dot character followed by.
Numbers that are closer to the section tag only that to be considered.

Test String:

<sectionb>2.3. Optimized test sentence<op>(</op>1,1<cp>)</cp></sectionb>
*<sectiona>2 Surface Model: ONGV<op>(</op>1,1<cp>)</cp></sectiona>*
<sectiona>3. Verification of MKJU<op>(</op>1,1<cp>)</cp> Entity</sectiona>
*<sectionc>3. 2. 1 <txt>Case 1</txt> Annual charges to SGX</sectionc>*
*<sectiona>Compound Interest<role>back</role></sectiona>*

Pattern:

<section[a-z]>[\d]*[^\.]*<\/section[a-z]

Regex Pattern Should Match the below string:

<sectiona>2 Surface Model: ONGV<op>(</op>1,1<cp>)</cp></sectiona>
<sectionc>3. 2 1 <txt>Case 1</txt> Annual charges to SGX</sectionc>
<sectiona>Compound Interest<role>back</role></sectiona>

CodePudding user response：

You can use this regex:

/<section[a-z]>([^\d]|\d (?![.])).*?<\/section[a-z]/g

Explanation:

<section[a-z]> - match literal <section and a letter and >

( - begin a group

[^\d]|\d - match either a non-digit OR a one or more digits

(?![.]) - NOT followed by a a dot .

) - end group

.*?<\/section[a-z]> - match any character zero or more times followed by the literal string </section followed by a letter and >

This will not match if one or more numbers are followed by a dot.

CodePudding user response：

This matches the updated requirements:

<section\w >(((\d \.\s*)*(\d [^\.]))|[^\d]).*?<\/section\w>

<section\w > \w is mostly the same as [a-z] with to allow for 0 or more (<section> <sectionabc>), remove for exactly one letter

(\d \.\s*)* 0 or more digit/dot/any number of spaces - match updated row 3 where it's now 3. 2. 1 with spaces after dots

(\d [^\.]) must match digit without a dot, one or more digits

((...)|[^\d]) or section does not start with a digit (match row 5)

.*? followed by any character, as few as times as possible upto the following </section - could likely do this with a look ahead to simplify the regex, but, for me, this keeps the separate "no digits" clause separate.

regex101