Python | Regex | get numbers from the text-CodePudding

I have text of the form

Refer to Annex 1.1, 1.2 and 2.0 containing information etc,

or

Refer to Annex 1.0.1, 1.1.1 containing information etc,

I need to extract the numbers that the Annex is referring to. I have tried lookbehind regex as below.

m = re.search("(?<=Annex)\s*[\d .\d ,] ", text)

print(m)
>>> <re.Match object; span=(11, 15), match=' 1.1'>

I get output as just 1.1, but I don't get remaining. How do I get all the numbers followed by keyword Annex ?

CodePudding user response：

You can use the following two-step solution:

import re
texts = ['Refer to Annex 1.1, 1.2 and 2.0 containing information etc,', 'Refer to Annex 1.0.1, 1.1.1 containing information etc,']
rx = re.compile(r'Annex\s*(\d (?:(?:\W|and) \d)*)')
for text in texts:
    match = rx.search(text)
    if match:
        print(re.findall(r'\d (?:\.\d )*', match.group(1)) )

See the Python and the regex demo, the output is

['1.1', '1.2', '2.0']
['1.0.1', '1.1.1']

The Annex\s*(\d (?:(?:\W|and) \d)*) regex matches

Annex - the string Annex
\s* - zero or more whitespaces
(\d (?:(?:\W|and) \d)*) - Group 1: one or more digits and then zero or more occurrences of a non-word char or and string and then a digit.

Then, when the match is found, all dot-separated digit sequences are extracted with \d (?:\.\d )*.