Home > Software engineering >  Regular expression to capture groups up until and inclusive of a specific-length number
Regular expression to capture groups up until and inclusive of a specific-length number

Time:12-14

I have a string that looks more or less like this:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et doloreere34545 magna aliqua. 

1\. Ut enim ad minim 5453 veniam

2\. quis nostrud exercitation ullamco 
14567883390
laboris nisi ut aliquip ex ea commodo consequat.
12\. Duis aute irure dolor in reprehenderit in voluptate 
velit esse cillum dolore

23432434234 
eu fugiat nulla pariatur. Excepteur sint 
occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

I would like to have a regular expression that captures groups up until and inclusive of the 11 digit number. In this case, I would like it to capture two groups:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et doloreere34545 magna aliqua. 

1\. Ut enim ad minim 5453 veniam

2\. quis nostrud exercitation ullamco 
14567883390

and

laboris nisi ut aliquip ex ea commodo consequat.
12\. Duis aute irure dolor in reprehenderit in voluptate 
velit esse cillum dolore

23432434234

while ignoring the remainder of the text.

I tried to play with re.findall() and positive lookbehinds such as (?<=\d{11}). I wasn't able to get what I want. What regular expression would allow me to do that?

CodePudding user response:

need to handle the \n(new line) and greedy .*, the following would work:

import re
s1="""
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et doloreere34545 magna aliqua. 

1\. Ut enim ad minim 5453 veniam

2\. quis nostrud exercitation ullamco 
14567883390
laboris nisi ut aliquip ex ea commodo consequat.
12\. Duis aute irure dolor in reprehenderit in voluptate 
velit esse cillum dolore

23432434234 
eu fugiat nulla pariatur. Excepteur sint 
occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
"""
for e in re.finditer(r'((?:.|\n)*?\d{11}).*?',s1): # making non greedy (?:.|\n)*? and ((?:.|\n)*?\d{11}).
    print(e.group())

result:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et doloreere34545 magna aliqua. 

1\. Ut enim ad minim 5453 veniam

2\. quis nostrud exercitation ullamco 
14567883390
-------

laboris nisi ut aliquip ex ea commodo consequat.
12\. Duis aute irure dolor in reprehenderit in voluptate 
velit esse cillum dolore

23432434234
-------

CodePudding user response:

Using (?<=\d{11}) asserts 11 digits directly to the left of the current position, but you want to match the 11 digits from the start of the string according to the example data.

If the match should not start with 11 digits, or match only 11 digits in the full match:

^(?!(?:\d{11})?[^\S\n]*$).*(?:\n(?!\d{11}[^\S\n]*$).*)*\n\d{11}[^\S\n]*$

Explanation

  • ^ Start of string
  • (?!(?:\d{11})?[^\S\n]*$) Assert not an empty line or 11 digits only
  • .* Match the whole line
  • (?: Non capture group
    • \n Match a newline
    • (?!\d{11}[^\S\n]*$).* Match a line that do not start with 11 digits
  • )* Close the non capture group and optionally repeat it
  • \n\d{11}[^\S\n]* Match a newline, 11 digits and optional spaces
  • $ End of string

Regex demo

import re

pattern = r"^(?!(?:\d{11})?[^\S\n]*$).*(?:\n(?!\d{11}[^\S\n]*$).*)*\n\d{11}[^\S\n]*$"

s = ("Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et doloreere34545 magna aliqua. \n\n"
            "1\\. Ut enim ad minim 5453 veniam\n\n"
            "2\\. quis nostrud exercitation ullamco \n"
            "14567883390\n"
            "laboris nisi ut aliquip ex ea commodo consequat.\n"
            "12\\. Duis aute irure dolor in reprehenderit in voluptate \n"
            "velit esse cillum dolore\n\n"
            "23432434234 \n"
            "eu fugiat nulla pariatur. Excepteur sint \n"
            "occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.")

print(re.findall(pattern, s, re.M))

Output

[
  'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et doloreere34545 magna aliqua. \n\n1\\. Ut enim ad minim 5453 veniam\n\n2\\. quis nostrud exercitation ullamco \n14567883390',
  'laboris nisi ut aliquip ex ea commodo consequat.\n12\\. Duis aute irure dolor in reprehenderit in voluptate \nvelit esse cillum dolore\n\n23432434234 '
]
  • Related