Need to retrieve dimensions from the text where they can be specified in a couple of ways:
- "... 10.5 inches x 5 feet x 2 inches ..."
- "... 10 inches x 5 inches ..."
- "... 10 inches x 5 inches ..."
- "... 10 x 5 inches ..."
There can be two or three dimensions and each might have its own measurement type or not have it.
I am struggling to add the list of optional dimension types and make the search of the third vale optional in the regex:
dimensions = re.findall(r'(\d \.?\d*)\s*inches?feet?\s*x\s*(\d \.?\d*)\s*inches?feet?\s*x?\s*(\d \.?\d*)?\s*inches?feet?',string)
CodePudding user response:
What you have is inches?feet?
, which says "match 0 to 1 'inches' and 0 to 1 'feet'". This means it could match something like "5 inchesfeet".
You were fairly close. The key idea you missed is that |
can be used to specify alternatives to match: (?:inches|feet)?
. They're put in a non-capturing group to clarify that only "feet" should be part of the alternative and not everything after it. The ?
at the end makes the entire group optional.
To make the entire third dimension optional, the pattern for it can be put in a non-capturing group, and then that group can be made optional with ?
:
(?:x\s*(\d \.?\d*)?\s*(?:inches|feet)?)?
The final regex is
(\d \.?\d*)\s*(?:inches|feet)?\s*x\s*(\d \.?\d*)\s*(?:inches|feet)?\s*(?:x\s*(\d \.?\d*)?\s*(?:inches|feet)?)?
CodePudding user response:
Here is one re.findall
approach which is working:
inp = """... 10 inches x 5 feet x 2 inches ...
... 10 inches x 5 inches ..."
... 10 inches x 5 inches ..."
... 10 x 5 inches ..."""
dims = re.findall(r'\d (?:\.\d )?(?: (?!x\b)\w )?(?: x \d (?:\.\d )?(?: (?!x\b)\w )?)*', inp)
print(dims)
This prints:
['10 inches x 5 feet x 2 inches',
'10 inches x 5 inches',
'10 inches x 5 inches',
'10 x 5 inches']
Here is an explanation of the regex pattern being used:
\d (?:\.\d )? match a number
(?:
[ ] followed by space
(?!x\b) NOT followed by 'x'
\w but is followed by any other dimension
)? optional
(?:
[ ] space
x 'x'
[ ] space
\d (?:\.\d )? another number
(?: (?!x\b)\w )? zero or more other numbers/dimensions
)* optional