Regex pattern for extracting substring from dataframe-CodePudding

I have a dataframe column as follows:

df['col1']

['cat-dog asd-pwr sdf', 'cat-goat asd-pwr2 sdf', 'cat asd-pwr3 sdf']

I need to extract the following:

['asd-pwr', 'asd-pwr2', 'asd-pwr3']

i.e the last pair of substrings which are connected by -

I tried the following:

import re
df['col1'].str.extract(r'\s[a-zA-Z]*-[a-zA-Z]*\s', flags=re.IGNORECASE)

First of all, my regex construct even fails to spot any pair of substrings as desired.

CodePudding user response：

You can use

import pandas as pd
df = pd.DataFrame({'col1': ['cat-dog asd-pwr sdf', 'cat-goat asd-pwr2 sdf', 'cat asd-pwr3 sdf']})
>>> df['col1'].str.extract(r'(?:.*\W)?(\w -\w )')
          0
0   asd-pwr
1  asd-pwr2
2  asd-pwr3

Or, if there can be start of string or whitespace on the left, you may also use

r'(?:.*\s)?(\w -\w )'

Details:

(?:.*\W)? - an optional sequence of any zero or more chars other than line break chars, as many as possibel, then a non-word char (\s matches a whitespace)
(\w -\w ) - Group 1: one or more word chars, - and one or more word chars.

As .* is greedy, the last part of the pattern between round brackets (aka capturing parentheses) gets the last occurrence of hyphenated words.

CodePudding user response：

This regex should do the trick

\w*-\w*(?=(\s|$)\w*.*$)

Only take the last object from the resulting match array.

CodePudding user response：

You can use:

import re

df['col1'].str.extract(r'\s*(\w -\w )(?!.*-)\s*', flags=re.IGNORECASE)

Here, we use \w instead of [a-zA-Z] because you also want to extract the number after pwr.

We also use negative lookahead (?!.*-) to ensure the current matching substring is the last substring with hyphen - in the string.

Result:

          0
0   asd-pwr
1  asd-pwr2
2  asd-pwr3