Home > database >  Extracting first numerical value occuring after some token in text in python
Extracting first numerical value occuring after some token in text in python

Time:10-31

I have sentences in the following form. I want to extract all numeric values occurring after any given token. For example, I want to extract all numeric values after the phrase "tangible net worth"

Example sentences:

  1. "A company must maintain a minimum tangible net worth of $100000000 and leverage ratio of 0.5"
  2. "Minimum required tangible net worth the firm needs to maintain is $50000000".

From both of these sentences, I want to extract "$100000000" and "$50000000" and create a dictionary like this:

{
    "tangible net worth": "$100000000"
}

I am unsure how to use the re python module to achieve this. Also, one needs to be careful here, a significant portion of sentences contain multiple numeric values. So, I want only to extract the immediate value occurring after the match. I have tried the following expressions, but none of them are giving desired results

re.search(r'net worth.*(\d )', sent)
re.search(r'(net worth)(.*)(\d )', sent)
re.search(r'(net worth)(.*)(\d?)', sent)
re.findall(r'tangible net worth (.*)?(\d* )', sent)
re.findall(r'tangible net worth (.*)?( \d* )', sent)
re.findall(r'tangible net worth (.*)?(\d)', sent)

A little help with the regular expression will be highly appreciated. Thanks.

CodePudding user response:

You could use this regex:

tangible net worth\D*(\d )

which will skip any non-digit characters after tangible net worth before capturing the first digits that occur after it.

You can then place the result into a dict. Note I would recommend storing a number rather than a string as you can always format it on output (adding $, comma thousands separators etc).

strs = [
    "A company must maintain a minimum tangible net worth of $100000000 and leverage ratio of 0.5",
    "Minimum required tangible net worth the firm needs to maintain is $50000000"
]

result = []
for sent in strs:
    m = re.findall(r'tangible net worth\D*(\d )', sent)
    if m:
        result  = [{ 'tangible net worth' : int(m[0]) }]

print(result)

Output:

[
 {'tangible net worth': 100000000},
 {'tangible net worth': 50000000}
]

CodePudding user response:

You can use:

tangible net worth.*?(\$?\d )

This will search form "tangible net worth" and then gets the next numeric value (with $ optional) as capturing group. Regex101 link.


import re

s = """\
A company must maintain a minimum tangible net worth of $100000000 and leverage ratio of 0.5
Minimum required tangible net worth the firm needs to maintain is $50000000"""

pat = re.compile(r"tangible net worth.*?(\$?\d )")

out = [{"tangible net worth": v} for v in pat.findall(s)]
print(out)

Prints:

[
    {"tangible net worth": "$100000000"}, 
    {"tangible net worth": "$50000000"}
]
  • Related