I have sentences in the following form. I want to extract all numeric values occurring after any given token. For example, I want to extract all numeric values after the phrase "tangible net worth"
Example sentences:
- "A company must maintain a minimum tangible net worth of $100000000 and leverage ratio of 0.5"
- "Minimum required tangible net worth the firm needs to maintain is $50000000".
From both of these sentences, I want to extract "$100000000"
and "$50000000"
and create a dictionary like this:
{
"tangible net worth": "$100000000"
}
I am unsure how to use the re
python module to achieve this. Also, one needs to be careful here, a significant portion of sentences contain multiple numeric values. So, I want only to extract the immediate value occurring after the match. I have tried the following expressions, but none of them are giving desired results
re.search(r'net worth.*(\d )', sent)
re.search(r'(net worth)(.*)(\d )', sent)
re.search(r'(net worth)(.*)(\d?)', sent)
re.findall(r'tangible net worth (.*)?(\d* )', sent)
re.findall(r'tangible net worth (.*)?( \d* )', sent)
re.findall(r'tangible net worth (.*)?(\d)', sent)
A little help with the regular expression will be highly appreciated. Thanks.
CodePudding user response:
You could use this regex:
tangible net worth\D*(\d )
which will skip any non-digit characters after tangible net worth
before capturing the first digits that occur after it.
You can then place the result into a dict. Note I would recommend storing a number rather than a string as you can always format it on output (adding $
, comma thousands separators etc).
strs = [
"A company must maintain a minimum tangible net worth of $100000000 and leverage ratio of 0.5",
"Minimum required tangible net worth the firm needs to maintain is $50000000"
]
result = []
for sent in strs:
m = re.findall(r'tangible net worth\D*(\d )', sent)
if m:
result = [{ 'tangible net worth' : int(m[0]) }]
print(result)
Output:
[
{'tangible net worth': 100000000},
{'tangible net worth': 50000000}
]
CodePudding user response:
You can use:
tangible net worth.*?(\$?\d )
This will search form "tangible net worth" and then gets the next numeric value (with $
optional) as capturing group. Regex101 link.
import re
s = """\
A company must maintain a minimum tangible net worth of $100000000 and leverage ratio of 0.5
Minimum required tangible net worth the firm needs to maintain is $50000000"""
pat = re.compile(r"tangible net worth.*?(\$?\d )")
out = [{"tangible net worth": v} for v in pat.findall(s)]
print(out)
Prints:
[
{"tangible net worth": "$100000000"},
{"tangible net worth": "$50000000"}
]