I have a strings that contain different values (numeric and non-numeric mixed). I want to be able to extract the values from the text. I could not get my head around how to extract all (or most of) possible cases. I have a partially working solution like this,
def extract_values(sentence):
#sentence = normalizeString(sentence)
matches = re.findall(r'((\d*\.?\d (?:\/\d*\.?\d )?)(?:\s and\s (\d*\.?\d (?:\/\d*\.?\d )?))?)', sentence)
# (\d\sto\s\d\s(and\s\d\/\d)*) << for adding 9 to 11, couldn't fix
result = []
for x,y,z in matches:
if '/' in x:
result.append(x)
else:
result.extend(filter(lambda x: x!="", [y,z]))
return result
Driver code,
extract_values("He is 1 and 1/2 years old. He is .5 years old and he is 5 years old. He is between 9 to 11 or 9 to 9 and 1/2. He was born 11/12/20")
Incorrect answer:
['1 and 1/2', '5', '5', '9', '11', '9', '9 and 1/2', '11/12', '20']
Expected answer:
['1 and 1/2', '.5', '5', '9 to 11', '9 to 9 and 1/2', '11/12/20']
Please note the difference between 5 and .5, and 'x to y' and 'x to y and z'
I would appreciate any help. Thank you.
CodePudding user response:
I would do it following way
import re
text = "He is 1 and 1/2 years old. He is .5 years old and he is 5 years old. He is between 9 to 11 or 9 to 9 and 1/2. He was born 11/12/20"
values = re.findall(r"\d (?:\s?(?:and|/|to)\s?\d )*",text)
print(values)
output
['1 and 1/2', '5', '5', '9 to 11', '9 to 9 and 1/2', '11/12/20']
Explanation: I used non-capturing group here. This pattern is searching for 1 or more digits followed by (and
or /
or to
, possibly with leading and/or trailing whitespace, followed by 1 or more digits) repeated zero or more times.
CodePudding user response:
You can use
import re
def extract_values(sentence):
num = r'\d*\.?\d (?:/\d*\.?\d )*'
return re.findall(fr'{num}(?:\s (?:and|to)\s {num})*', sentence)
print(extract_values("He is 1 and 1/2 years old. He is .5 years old and he is 5 years old. He is between 9 to 11 or 9 to 9 and 1/2. He was born 11/12/20"))
# => ['1 and 1/2', '.5', '5', '9 to 11', '9 to 9 and 1/2', '11/12/20']
See the Python demo, and the regex demo.
Details:
\d*\.?\d (?:/\d*\.?\d )*
- a float/int number, and then zero or more occurrences of/
and a float/int number(?:\s (?:and|to)\s \d*\.?\d (?:/\d*\.?\d )*)*
- zero or more occurrences of\s (?:and|to)\s
-and
orto
enclosed with one or more whitespaces\d*\.?\d (?:/\d*\.?\d )*
- a float/int number, and then zero or more occurrences of/
and a float/int number.