Python Pandas Extract text between words and symbols with regex-CodePudding

I am trying to extract certain words between some text and symbol using regex in Python Pandas. The problem is that sometimes there are non-characters after the word I want to extract, and sometimes there is nothing.

Here is the input table.

Here is the expected output.

I've tried this str.extract(r'[a-zA-Z\W] \s Reason to buy:([a-zA-Z\s] )[a-zA-Z\W] \s ') but does not work.

Any advice is appreciated.

Thanks.

CodePudding user response：

You just need to capture the part matching any zero or more chars up to <> or end of string:

df['reasons'] = df['reasons'].str.extract(r"Reason to buy:\s*(.*?)(?=<>|$)", expand=False)

See the regex demo.

Details:

Reason to buy: - a string
\s* - zero or more whitespace
(.*?) - a capturing group with ID 1 that matches zero or more chars other than line break chars as few as possible
(?=<>|$) - a positive lookahead that requires a <> string or end of string immediately to the right of the current location.

Note that Series.str.extract requires at least one capturing group in the pattern to actually return a value, so you can even use a non-capturing group instead of the positive lookahead:

df['reasons'] = df['reasons'].str.extract(r"Reason to buy:\s*(.*?)(?:<>|$)", expand=False)

CodePudding user response：

You can use a lookaround to extract your reason to buy:

(?<=Reason to buy: )([^<] )

And use it in your Python code as follows:

import pandas as pd
import re

df = pd.DataFrame([
    [1, 'Team: Asian<>Reason to buy: leisure'             ],
    [2, 'Team: SouthAmerica<>Reason to buy: educational<>'],
    [3, 'Team: Australia<>Reason to buy: commercial'      ],
    [4, 'Team: North America<>Reason to buy: leisure<>'   ],
    [5, 'Team: Europe<>Reason to buy: leisure<>'          ],
], columns = ['team_id', 'reasons'])

pattern = r'(?<=Reason to buy: )([A-Za-z] )'

df['reasons'] = df['reasons'].str.extract(pattern)

Output:

   team_id      reasons
0        1      leisure
1        2  educational
2        3   commercial
3        4      leisure
4        5      leisure

Check the Regex demo and the Python demo.