Regex extract arbitrary number of subpatterns-CodePudding

I have sentences that have the structure "Name has digit1 word1, digit2 word2, ..., and digitN wordN" where the number of subpatterns "digit word" varies across sentences, and therefore is uncertain. There is an "and" before the last subpattern. e.g. "Alice has 1 apple, 2 bananas, ...., and 6 oranges."

How can I extract these digits and words using regex in python? I expect output below:

Name,

Digit	Word
digit1	word1
digit2	word2
...	...
digitN	wordN

I have tried the following:

s = 'Alice has 1 apple, 2 bananas, and 3 oranges.'
import re
matches = re.finditer(r'([Aa-z] ) has (\d) ([a-z] )( and)*', s)
for match in matches:
  print(match.groups())

But this only gives me ('Alice', '1', 'apple', None), missing '2', 'bananas', '3', 'oranges'.

CodePudding user response：

If you wanted to match everything in a single regex, you'd want something like this:

([^\s] ) has (?:(?:,\s )?(?:and\s )?(\d )\s ([^\s,] )){1,}

Regex Demo

However, I'm not sure that python can handle a repeating group. At least, I haven't found a way to pull the repeating group from the python object.

So here's how I'd recommend approaching the problem:

import re

s = 'Alice has 1 apple, 2 bananas, and 3 oranges.'

matches = re.match(r'^([^\s] )', s)
print(f'Name: {matches.group(0)}')

matches = re.findall(r'(?:(?:,\s )?(?:and\s )?(\d )\s ([^\s,] ))', s)

for match in matches:
    print(f'{match[0]} - {match[1]}')

Sample Output

Name: Alice
1 - apple
2 - bananas
3 - oranges.

Process finished with exit code 0

Regex Explanations

^([^\s] ) - Few different ways to right this, but it's just grabbing everything until the first space in the string.

(?:           - Non-capturing group
 (?:,\s )?    - Optionally allow the string to have a `,` followed by spaces
 (?:and\s )?  - Optionally allow the string to contain the word `and` followed by spaces
 (\d )        - Must have a number
 \s           - Spaces between number and description
 ([^\s,] )    - Grab the next set of characters and stop when you find a space or comma. This should be the word (e.g. apple)
)

This second regex just ensure you can pull various forms of the 1 apple. So it will basically match the following patterns:

1 apple
, 1 apple
, and 1 apple
and 1 apple

As a side note a parser is better suited for these problems in the long-run. You get much more variance in the sentence and it starts becoming pretty difficult to parse using a simple regex.

CodePudding user response：

Use PyPi regex.

See Python code:

import regex
s = 'Alice has 1 apple, 2 bananas, and 3 oranges.'
matches = regex.finditer(r'(?P<word1>[A-Za-z] ) has(?:(?:\s |,\s |,?\s and\s )?(?P<number>\d )\s (?P<word2>[a-z] ))*', s)
for match in matches:
  print(match.capturesdict())

Results: {'word1': ['Alice'], 'number': ['1', '2', '3'], 'word2': ['apple', 'bananas', 'oranges']}