Home > front end >  Extract Multiple Numbers from a String in Python
Extract Multiple Numbers from a String in Python

Time:01-07

I have the following data in Excel

Work_Experience
6 Year(s) 1 Month(s)
12 Year(s) 11 Month(s)
10 Year(s) 10 Month(s)
10 Year(s) 2 Month(s)
2 Year(s) 12 Month(s)

Now I want that in Python there should be two extra columns generated as Output which have # of Years (Digit) in Column B and # of Months (Digit) in Column C like the Output shown below

Work_Experience       Year  Month
6 Year(s) 1 Month(s)     6      1
12 Year(s) 11 Month(s)  12     11
10 Year(s) 10 Month(s)  10     10
10 Year(s) 2 Month(s)   10      2
2 Year(s) 12 Month(s)    2     12

I Tried the following Code Below

Test[['Year','Month']] = Test['Work_Experience'].str.extract(\(\d )(\d ))

it's showing SyntaxError: unexpected character after line continuation character

CodePudding user response:

You can use str.extract:

df[['Year', 'Month']] = (df['Work_Experience']
                         .str.extract('(\d )\s*Year.*?(\d )\s*Month')
                         .astype(int)
                         )

Output:

          Work_Experience Year Month
0    6 Year(s) 1 Month(s)    6     1
1  12 Year(s) 11 Month(s)   12    11
2  10 Year(s) 10 Month(s)   10    10
3   10 Year(s) 2 Month(s)   10     2
4   2 Year(s) 12 Month(s)    2    12

alternative

If you want an alternative that extracts the names in any order and automatically assigns the next word as column name:

df = df.join(df['Work_Experience']
 .str.extractall('(\d )\s*(\w )')
 .droplevel(1)
 .pivot(columns=1, values=0).astype(int)
)

CodePudding user response:

Your pattern \(\d )(\d ) starts with matching a literal parenthesis \( in \(\d ) but then has an unclosed (unescaped) parenthesis after it that has a different meaning. There are also characters in between the digits that are not matched.

Note that you have to put the regex between quotes when using str.extract

If you want to combine matching the parenthesis and the groupings of the digits:

\b(\d )\s Year\(s\)\s (\d )\s Month\(s\)

Explanation

  • \b A word boundary
  • (\d ) Capture 1 digits in group 1
  • \s Year\(s\)\s Match Year(s) between 1 whitespace chars
  • (\d ) Capture 1 digits in group 2
  • \s Month\(s\) Match 1 whitspace chars and Month(s)`

See a regex101 demo.

Test[['Year', 'Month']] = Test['Work_Experience'].str.extract(r'\b(\d )\s Year\(s\)\s (\d )\s Month\(s\)')
print(Test)

Output

          Work_Experience Year Month
0    6 Year(s) 1 Month(s)    6     1
1  12 Year(s) 11 Month(s)   12    11
2  10 Year(s) 10 Month(s)   10    10
3   10 Year(s) 2 Month(s)   10     2
4   2 Year(s) 12 Month(s)    2    12
  • Related