Home > database >  How to extract from text digits and calculate them in a specific order?
How to extract from text digits and calculate them in a specific order?

Time:11-01

I'm scraping data from website which I want to analyze. In the section about job experience, I extract text which specifies how long someone works at the company - this information looks like this:

Employment period \n2years 2 mon.

It would be easier to analyze the period of employment expressed in months. Now I wonder how to extract this information from the text and calculate this properly. Calculation of the given example should be:

2 x 12 2

I try to do this in that way:

def text_format(text: str):
    digits = []
    text = text.replace('\\n', ' ')
    text = text.replace('.', '')
    text = text.split()
    for word in text:
        if word.isalpha():
            pass
        else:
            word = int(word)
            digits.append(word)
    
    total = digits[0] * 12   digits[1]
    
    return total

In this particular case, the function above works well, but I may have other situations, e.g.

Employment period \n3years

OR

Employment period 11 mon.

I have no idea how to handle all possible scenarios.

CodePudding user response:

>>> str_1 = "Employment period \n2years 2 mon."
>>> str_2 = "Employment period \n3years"
>>> str_3 = "Employment period 11 mon."

>>> def func(x):
...    return (
...        eval(x.strip("Employment period ")
...             .strip()
...             .replace("years", "* 12  ")
...             .replace("mon.", "")
...             .strip()
...             .rstrip(" ")
...            ))


>>> func(str_1)
26

>>> func(str_2)
36

>>> func(str_3)
11

CodePudding user response:

You can use a regex to tackle this problem and cover all possible scenarios. For example, a one like below should make this task much easier:

Employment period(?:[ \\n] (\d )[ ]*years?)?(?:[ \\n] (\d )[ ]*mon\.)?

You can try it out here on the Regex Demo as well.

Here's a Python example that runs through the specific use cases mentioned, along with some additional edge cases that I added:

import re

pattern = re.compile(r'Employment period(?:[ \\n] (\d )[ ]*years?)?(?:[ \\n] (\d )[ ]*mon\.)?')

string = r"""\
Employment period \n2years 2 mon.
Employment period \n3years
Employment period 11 mon.
Employment period 010 years
Employment period 1 year
Employment period
testing\
"""

for x in pattern.finditer(string):
    print('Found a match:', x.group(0))
    years, months = x.groups()
    if years or months:
        total_months = int(years or 0) * 12   int(months or 0)
        print(f'Total months: {total_months}')

Output:

Found a match: Employment period \n2years 2 mon.
Total months: 26
Found a match: Employment period \n3years
Total months: 36
Found a match: Employment period 11 mon.
Total months: 11
Found a match: Employment period 010 years
Total months: 120
Found a match: Employment period 1 year
Total months: 12
Found a match: Employment period
  • Related