Home > Software engineering >  How to extract all numeric like values from string?
How to extract all numeric like values from string?

Time:12-24

I have a strings that contain different values (numeric and non-numeric mixed). I want to be able to extract the values from the text. I could not get my head around how to extract all (or most of) possible cases. I have a partially working solution like this,

def extract_values(sentence):
    #sentence = normalizeString(sentence)
    matches = re.findall(r'((\d*\.?\d (?:\/\d*\.?\d )?)(?:\s and\s (\d*\.?\d (?:\/\d*\.?\d )?))?)', sentence)    
    # (\d\sto\s\d\s(and\s\d\/\d)*) << for adding 9 to 11, couldn't fix

    result = []
    for x,y,z in matches:
        if '/' in x:
            result.append(x)
        else:
            result.extend(filter(lambda x: x!="", [y,z]))
    return result

Driver code,

extract_values("He is 1 and 1/2 years old. He is .5 years old and he is 5 years old. He is between 9 to 11 or 9 to 9 and 1/2. He was born 11/12/20")

Incorrect answer:

['1 and 1/2', '5', '5', '9', '11', '9', '9 and 1/2', '11/12', '20']

Expected answer:

['1 and 1/2', '.5', '5', '9 to 11', '9 to 9 and 1/2', '11/12/20']

Please note the difference between 5 and .5, and 'x to y' and 'x to y and z'

I would appreciate any help. Thank you.

CodePudding user response:

I would do it following way

import re
text = "He is 1 and 1/2 years old. He is .5 years old and he is 5 years old. He is between 9 to 11 or 9 to 9 and 1/2. He was born 11/12/20"
values = re.findall(r"\d (?:\s?(?:and|/|to)\s?\d )*",text)
print(values)

output

['1 and 1/2', '5', '5', '9 to 11', '9 to 9 and 1/2', '11/12/20']

Explanation: I used non-capturing group here. This pattern is searching for 1 or more digits followed by (and or / or to, possibly with leading and/or trailing whitespace, followed by 1 or more digits) repeated zero or more times.

CodePudding user response:

You can use

import re

def extract_values(sentence):
   num = r'\d*\.?\d (?:/\d*\.?\d )*'
   return re.findall(fr'{num}(?:\s (?:and|to)\s {num})*', sentence)

print(extract_values("He is 1 and 1/2 years old. He is .5 years old and he is 5 years old. He is between 9 to 11 or 9 to 9 and 1/2. He was born 11/12/20"))
# => ['1 and 1/2', '.5', '5', '9 to 11', '9 to 9 and 1/2', '11/12/20']

See the Python demo, and the regex demo.

Details:

  • \d*\.?\d (?:/\d*\.?\d )* - a float/int number, and then zero or more occurrences of / and a float/int number
  • (?:\s (?:and|to)\s \d*\.?\d (?:/\d*\.?\d )*)* - zero or more occurrences of
    • \s (?:and|to)\s - and or to enclosed with one or more whitespaces
    • \d*\.?\d (?:/\d*\.?\d )* - a float/int number, and then zero or more occurrences of / and a float/int number.
  • Related