I want to extract with regex patterns out of a text like this:
text= """cakes 10 are good.
cakes 10c are good.
cakes 20 21 22 are good.
cakes 30, 31, 32 are good.
cakes 40a, 40b, 40c are good."""
What I want to achieve is to extract the numericals 10, 10c etc., but as a whole.
I build up the pattern as follows:
numerical = r"""
[0-9]{1,4}((\.|\-)?[A-Za-z])? # max 4 digits followed optionally by a letter separated by , or - optionally.
"""
# those could be separated by:
separator = r"""(,\s?)""" # comma followed optionally by space
# code to get the matches with finditer
pattern2 = fr"""(({numerical}
(\s?|({separator})?))
|
{numerical} # single case
)"""
refs = re.finditer(pattern2, text, re.VERBOSE,)
for element in refs:
print (element.group())
This gives me all the results individually. But I would like to get the ONE match.
Expected result:
10
10c
20 21 22
30, 31, 32
40a, 40b, 40c
Note: I need to use finditer
because later on I need to access the spans.
EDIT: It would be very convenient to use this type of pattern composition because then the separator part could be easily and the numerical as well.
With one single regex pattern the code is very difficult to modify by a later reader.
CodePudding user response:
Try (regex101):
import re
text = """cakes 10 are good.
cakes 10c are good.
cakes 20 21 22 are good.
cakes 30, 31, 32 are good.
cakes 40a, 40b, 40c are good."""
for m in re.finditer(r"(?:\d{1,4}[.-]?[a-zA-Z]?\s*,?\s*) ", text):
print(m.group())
Prints:
10
10c
20 21 22
30, 31, 32
40a, 40b, 40c
CodePudding user response:
- In your pattern you use a separator
separator = r"""(,\s?)"""
where the comma is mandatory, which is not the case for your example data.
What you can do it make the comma optional: separator = r"""(,?\s?)"""
- In this part
(\s?|({separator})?))
there is an alternation that can match either an optional whitespace char or optionally the separator, and you get the undesired result as the\s?
is optional and will always match first in the alternation.
What you can do is switch the order of the alternation: (({separator})?|\s?))
This will give you the expected outcome, see a Python demo.
But if you print the pattern now and remove the spaces, you will see that there are unneeded capture groups, alternation to optimize using a character class and you can actually remove the second alternation {numerical} # single case
as all in the separator part is optional.
When matching a repeated comma separated part, what you can do is match the numerical part first, and then optionally repeat the separator and then again the numerical part.
This will also prevent matching a trailing space.
The code could look like:
import re
text = """cakes 10 are good.
cakes 10c are good.
cakes 20 21 22 are good.
cakes 30, 31, 32 are good.
cakes 40a, 40b, 40c are good."""
numerical = r"""\b[0-9]{1,4}(?:[.-]?[A-Za-z])?"""
# those could be separated by:
separator = r""",?\s"""
# code to get the matches with finditer
pattern2 = fr"""{numerical}(?:{separator}{numerical})*"""
refs = re.finditer(pattern2, text, re.VERBOSE)
for element in refs:
print(element.group())
Output
10
10c
20 21 22
30, 31, 32
40a, 40b, 40c
The pattern now looks like:
\b[0-9]{1,4}(?:[.-]?[A-Za-z])?(?:,?\s[0-9]{1,4}(?:[.-]?[A-Za-z])?)*
Explanation
\b[0-9]{1,4}
A word boundary to prevent a partial word match and match 1-4 digits(?:
Non capture group[.-]?[A-Za-z]
Optionally match either a.
or-
using a character class
)?
Close non capture group and make it optional(?:
Non capture group,?\s
Match the delimiter, an optional comma and a space[0-9]{1,4}(?:[.-]?[A-Za-z])?
The numerical pattern again
)*
Close the non capture group and optionally repeat