I have a dataframe column as follows:
df['col1']
['cat-dog asd-pwr sdf', 'cat-goat asd-pwr2 sdf', 'cat asd-pwr3 sdf']
I need to extract the following:
['asd-pwr', 'asd-pwr2', 'asd-pwr3']
i.e the last pair of substrings which are connected by -
I tried the following:
import re
df['col1'].str.extract(r'\s[a-zA-Z]*-[a-zA-Z]*\s', flags=re.IGNORECASE)
First of all, my regex construct even fails to spot any pair of substrings as desired.
CodePudding user response:
You can use
import pandas as pd
df = pd.DataFrame({'col1': ['cat-dog asd-pwr sdf', 'cat-goat asd-pwr2 sdf', 'cat asd-pwr3 sdf']})
>>> df['col1'].str.extract(r'(?:.*\W)?(\w -\w )')
0
0 asd-pwr
1 asd-pwr2
2 asd-pwr3
Or, if there can be start of string or whitespace on the left, you may also use
r'(?:.*\s)?(\w -\w )'
Details:
(?:.*\W)?
- an optional sequence of any zero or more chars other than line break chars, as many as possibel, then a non-word char (\s
matches a whitespace)(\w -\w )
- Group 1: one or more word chars,-
and one or more word chars.
As .*
is greedy, the last part of the pattern between round brackets (aka capturing parentheses) gets the last occurrence of hyphenated words.
CodePudding user response:
This regex should do the trick
\w*-\w*(?=(\s|$)\w*.*$)
Only take the last object from the resulting match array.
CodePudding user response:
You can use:
import re
df['col1'].str.extract(r'\s*(\w -\w )(?!.*-)\s*', flags=re.IGNORECASE)
Here, we use \w
instead of [a-zA-Z]
because you also want to extract the number after pwr
.
We also use negative lookahead (?!.*-)
to ensure the current matching substring is the last substring with hyphen -
in the string.
Result:
0
0 asd-pwr
1 asd-pwr2
2 asd-pwr3