I have a string mixed with numbers and words. I want to be able to extract the numeric values from the string as tokens.
For example,
input
str = "Score 1 and 2 sometimes, often 1 and 1/2, or 2.5 or 3 and 1/3." should ideally
output,
Score -> word
1 -> number
and -> word
2 -> number
...
1 and 1/2 -> number (this group should stay together as number)
or -> word
2.5 -> number
...
3 and 1/3 -> number
I could solve the problem partly by using regex as follows,
rule 1:
re.findall(r'\s*(\d*\.?\d )\s*', str1) and
rule 2:
re.findall(r'(?:\s*\d* and \d \/\d \s*)', str1)
It partly works. I could not put these together to solve the problem. I tried this,
re.findall(r'(?:\s*(\d*\.?\d )\s*)|(?:\s*\d* and \d \/\d \s*)', str1)
Can anyone please help and show how I could put the rules together and get the result?
CodePudding user response:
You can use
import re
text = "Score 1 and 2 sometimes, often 1 and 1/2, or 2.5 or 3 and 1/3."
matches = re.findall(r'((\d*\.?\d (?:\/\d*\.?\d )?)(?:\s and\s (\d*\.?\d (?:\/\d*\.?\d )?))?)', text)
result = []
for x,y,z in matches:
if '/' in x:
result.append(x)
else:
result.extend(filter(lambda x: x!="", [y,z]))
print( result )
# => ['1', '2', '1 and 1/2', '2.5', '3 and 1/3']
See the Python demo. Here is the regex demo.
Details:
- The regex contains three capturing groups, around it as a whole, and two groups wrapping number or fraction patterns.
- Once you get a match, either put the one with
/
char into theresult
, or the two other captures as separate items otherwise.
The regex par matches
(
- outer capturing group start (Group 1):(\d*\.?\d (?:\/\d*\.?\d )?)
- Group 2: a number/fraction pattern: zero or more digits, an optional.
, one or more digits and then an optional occurrence of a/
char and then zero or more digits, an optional.
, one or more digits(?:\s and\s (\d*\.?\d (?:\/\d*\.?\d )?))?
- an optional occurrence of\s and\s
-and
word with one or more whitespaces around it(\d*\.?\d (?:\/\d*\.?\d )?)
- Group 3: number/fraction pattern
)
- outer capturing group end.