I have text of the form
Refer to Annex 1.1, 1.2 and 2.0 containing information etc,
or
Refer to Annex 1.0.1, 1.1.1 containing information etc,
I need to extract the numbers that the Annex
is referring to.
I have tried lookbehind regex as below.
m = re.search("(?<=Annex)\s*[\d .\d ,] ", text)
print(m)
>>> <re.Match object; span=(11, 15), match=' 1.1'>
I get output as just 1.1, but I don't get remaining. How do I get all the numbers followed by keyword Annex
?
CodePudding user response:
You can use the following two-step solution:
import re
texts = ['Refer to Annex 1.1, 1.2 and 2.0 containing information etc,', 'Refer to Annex 1.0.1, 1.1.1 containing information etc,']
rx = re.compile(r'Annex\s*(\d (?:(?:\W|and) \d)*)')
for text in texts:
match = rx.search(text)
if match:
print(re.findall(r'\d (?:\.\d )*', match.group(1)) )
See the Python and the regex demo, the output is
['1.1', '1.2', '2.0']
['1.0.1', '1.1.1']
The Annex\s*(\d (?:(?:\W|and) \d)*)
regex matches
Annex
- the stringAnnex
\s*
- zero or more whitespaces(\d (?:(?:\W|and) \d)*)
- Group 1: one or more digits and then zero or more occurrences of a non-word char orand
string and then a digit.
Then, when the match is found, all dot-separated digit sequences are extracted with \d (?:\.\d )*
.