I am trying to extract certain words between some text and symbol using regex in Python Pandas. The problem is that sometimes there are non-characters after the word I want to extract, and sometimes there is nothing.
Here is the input table.
Here is the expected output.
I've tried this str.extract(r'[a-zA-Z\W] \s Reason to buy:([a-zA-Z\s] )[a-zA-Z\W] \s ') but does not work.
Any advice is appreciated.
Thanks.
CodePudding user response:
You just need to capture the part matching any zero or more chars up to <>
or end of string:
df['reasons'] = df['reasons'].str.extract(r"Reason to buy:\s*(.*?)(?=<>|$)", expand=False)
See the regex demo.
Details:
Reason to buy:
- a string\s*
- zero or more whitespace(.*?)
- a capturing group with ID 1 that matches zero or more chars other than line break chars as few as possible(?=<>|$)
- a positive lookahead that requires a<>
string or end of string immediately to the right of the current location.
Note that Series.str.extract
requires at least one capturing group in the pattern to actually return a value, so you can even use a non-capturing group instead of the positive lookahead:
df['reasons'] = df['reasons'].str.extract(r"Reason to buy:\s*(.*?)(?:<>|$)", expand=False)
CodePudding user response:
You can use a lookaround to extract your reason to buy:
(?<=Reason to buy: )([^<] )
And use it in your Python code as follows:
import pandas as pd
import re
df = pd.DataFrame([
[1, 'Team: Asian<>Reason to buy: leisure' ],
[2, 'Team: SouthAmerica<>Reason to buy: educational<>'],
[3, 'Team: Australia<>Reason to buy: commercial' ],
[4, 'Team: North America<>Reason to buy: leisure<>' ],
[5, 'Team: Europe<>Reason to buy: leisure<>' ],
], columns = ['team_id', 'reasons'])
pattern = r'(?<=Reason to buy: )([A-Za-z] )'
df['reasons'] = df['reasons'].str.extract(pattern)
Output:
team_id reasons
0 1 leisure
1 2 educational
2 3 commercial
3 4 leisure
4 5 leisure