How to seperate numeric values from string using regex in python?-CodePudding

I have a string mixed with numbers and words. I want to be able to extract the numeric values from the string as tokens.

For example,

input
str = "Score 1 and 2 sometimes, often 1 and 1/2, or 2.5 or 3 and 1/3." should ideally 

output, 
Score -> word
1 -> number 
and -> word
2 -> number 
...
1 and 1/2 -> number (this group should stay together as number)
or -> word
2.5 -> number
...
3 and 1/3 -> number

I could solve the problem partly by using regex as follows,

rule 1:
re.findall(r'\s*(\d*\.?\d )\s*', str1) and 
rule 2:
re.findall(r'(?:\s*\d* and \d \/\d \s*)', str1)

It partly works. I could not put these together to solve the problem. I tried this,

re.findall(r'(?:\s*(\d*\.?\d )\s*)|(?:\s*\d* and \d \/\d \s*)', str1)

Can anyone please help and show how I could put the rules together and get the result?

CodePudding user response：

You can use

import re

text = "Score 1 and 2 sometimes, often 1 and 1/2, or 2.5 or 3 and 1/3."

matches = re.findall(r'((\d*\.?\d (?:\/\d*\.?\d )?)(?:\s and\s (\d*\.?\d (?:\/\d*\.?\d )?))?)', text)

result = []
for x,y,z in matches:
    if '/' in x:
        result.append(x)
    else:
        result.extend(filter(lambda x: x!="", [y,z]))

print( result )
# => ['1', '2', '1 and 1/2', '2.5', '3 and 1/3']

See the Python demo. Here is the regex demo.

Details:

The regex contains three capturing groups, around it as a whole, and two groups wrapping number or fraction patterns.
Once you get a match, either put the one with / char into the result, or the two other captures as separate items otherwise.

The regex par matches

( - outer capturing group start (Group 1):
(\d*\.?\d (?:\/\d*\.?\d )?) - Group 2: a number/fraction pattern: zero or more digits, an optional ., one or more digits and then an optional occurrence of a / char and then zero or more digits, an optional ., one or more digits
(?:\s and\s (\d*\.?\d (?:\/\d*\.?\d )?))? - an optional occurrence of
- \s and\s - and word with one or more whitespaces around it
- (\d*\.?\d (?:\/\d*\.?\d )?) - Group 3: number/fraction pattern
) - outer capturing group end.