Python regex: avoid to match first occurrence only-CodePudding

I have the current test text:

"but two of them were fool saying twenty six nine seven twenty six then the seven thousand twenty three people and all the people saying is three two just one person said one thousand thirty three and three and two and seven three "

I would like to match the numbers as they should be so two, twenty six, nine, seven, twenty six, seven thousand twenty three etc.

I am actually playing with this regex:

however, when I run re.findall(REGEX, text) I can't get seven thousand twenty three but I just have two, twenty six, seven, twenty six, seven thousand, three - so seven thousand is found but the right answer should be seven thousand twenty three.

EDIT:

I know that in my regex there's seven thousand and seven thousand twenty three. What I would like re is to match with respect to the next words as well, so it could capture seven thousand twenty three and not just stopping at seven thousand

Is there a way to get seven thousand twenty three as a whole?

CodePudding user response：

You can convert the regex from the "Regex to Match Numbers in Plain English" Rexegg.com page.

See the Python demo:

import re

one_to_9 = r"(?:f(?:ive|our)|s(?:even|ix)|t(?:hree|wo)|(?:ni|o)ne|eight)" # end one_to_9 definition
ten_to_19 = r"(?:(?:(?:s(?:even|ix)|f(?:our|if)|nine)te|e(?:ighte|lev))en|t(?:(?:hirte)?en|welve))" # end ten_to_19 definition
two_digit_prefix = r"(?:(?:s(?:even|ix)|t(?:hir|wen)|f(?:if|or)|eigh|nine)ty)" # end two_digit_prefix definition
one_to_99 = fr"(?:{two_digit_prefix}(?:[-\s]{one_to_9})?|{ten_to_19}|{one_to_9})" # end one_to_99 definition
one_to_999 = fr"(?:{one_to_9}\shundred(?:\s(?:and\s)?{one_to_99})?|{one_to_99})" # end one_to_999 definition
one_to_999_999 = fr"(?:{one_to_999}\sthousand(?:\s{one_to_999})?|{one_to_999})" # end one_to_999_999 definition
one_to_999_999_999 = fr"(?:{one_to_999}\smillion(?:\s{one_to_999_999})?|{one_to_999_999})" # end one_to_999_999_999 definition
one_to_999_999_999_999 = fr"(?:{one_to_999}\sbillion(?:\s{one_to_999_999_999})?|{one_to_999_999_999})" # end one_to_999_999_999_999 definition
one_to_999_999_999_999_999 = fr"(?:{one_to_999}\strillion(?:\s{one_to_999_999_999_999})?|{one_to_999_999_999_999})" # end one_to_999_999_999_999_999 definition
bignumber = fr"(?:zero|{one_to_999_999_999_999_999})" # end bignumber definition
zero_to_9 = fr"(?:{one_to_9}|zero)" # end zero to 9 definition
decimals = fr"point(?:\s{zero_to_9}) " # end decimals definition
numeral_pattern = fr"{bignumber}(?:\s{decimals})?"

rx = re.compile(numeral_pattern)
text = "but two of them were fool saying twenty six nine seven twenty six then the seven thousand twenty three people and all the people saying is three two just one person said one thousand thirty three and three and two and seven three"
print(rx.findall(text))

Output:

['two', 'twenty six', 'nine', 'seven', 'twenty six', 'seven thousand twenty three', 'three', 'two', 'one', 'one thousand thirty three', 'three', 'two', 'seven', 'three']

CodePudding user response：

You have a problem in your regex. You must swap \sseven\sthousand\stwenty\sthree with \sseven\sthousand.

As below a correct regex will get seven thousand twenty three

\b(one|two|three|seven|\stwenty\ssix|\sseven\sthousand\stwenty\sthree|\sseven\sthousand|\sone\sthousand\sthirty\sthree)\b

Demo

import re

regex = r"\b(one|two|three|seven|\stwenty\ssix|\sseven\sthousand\stwenty\sthree|\sseven\sthousand|\sone\sthousand\sthirty\sthree)\b"

test_str = "\"but two of them were fool saying twenty six nine seven twenty six then the seven thousand twenty three people and all the people saying is three two just one person said one thousand thirty three and three and two and seven three \""

matches = re.finditer(regex, test_str, re.MULTILINE)