I'm scraping data from website which I want to analyze. In the section about job experience, I extract text which specifies how long someone works at the company - this information looks like this:
Employment period \n2years 2 mon.
It would be easier to analyze the period of employment expressed in months. Now I wonder how to extract this information from the text and calculate this properly. Calculation of the given example should be:
2 x 12 2
I try to do this in that way:
def text_format(text: str):
digits = []
text = text.replace('\\n', ' ')
text = text.replace('.', '')
text = text.split()
for word in text:
if word.isalpha():
pass
else:
word = int(word)
digits.append(word)
total = digits[0] * 12 digits[1]
return total
In this particular case, the function above works well, but I may have other situations, e.g.
Employment period \n3years
OR
Employment period 11 mon.
I have no idea how to handle all possible scenarios.
CodePudding user response:
>>> str_1 = "Employment period \n2years 2 mon."
>>> str_2 = "Employment period \n3years"
>>> str_3 = "Employment period 11 mon."
>>> def func(x):
... return (
... eval(x.strip("Employment period ")
... .strip()
... .replace("years", "* 12 ")
... .replace("mon.", "")
... .strip()
... .rstrip(" ")
... ))
>>> func(str_1)
26
>>> func(str_2)
36
>>> func(str_3)
11
CodePudding user response:
You can use a regex to tackle this problem and cover all possible scenarios. For example, a one like below should make this task much easier:
Employment period(?:[ \\n] (\d )[ ]*years?)?(?:[ \\n] (\d )[ ]*mon\.)?
You can try it out here on the Regex Demo as well.
Here's a Python example that runs through the specific use cases mentioned, along with some additional edge cases that I added:
import re
pattern = re.compile(r'Employment period(?:[ \\n] (\d )[ ]*years?)?(?:[ \\n] (\d )[ ]*mon\.)?')
string = r"""\
Employment period \n2years 2 mon.
Employment period \n3years
Employment period 11 mon.
Employment period 010 years
Employment period 1 year
Employment period
testing\
"""
for x in pattern.finditer(string):
print('Found a match:', x.group(0))
years, months = x.groups()
if years or months:
total_months = int(years or 0) * 12 int(months or 0)
print(f'Total months: {total_months}')
Output:
Found a match: Employment period \n2years 2 mon.
Total months: 26
Found a match: Employment period \n3years
Total months: 36
Found a match: Employment period 11 mon.
Total months: 11
Found a match: Employment period 010 years
Total months: 120
Found a match: Employment period 1 year
Total months: 12
Found a match: Employment period