Applying regex to each row of a pandas DataFrame to remove all characters before a specific word-CodePudding

I have a column that I am trying to clean by removing all words before a specific word.

data = ['The text is interesting but short' ,'The text is interesting but short' ,'The text is interesting but short' ,'The text is interesting but short' ,'The text is interesting but short' ,'The text is interesting but short' ]
df = pd.DataFrame(data, columns=['Text'])

I would like to remove all the words before "interesting" in each row of the column "Text".

I found that it is possible to do it using regular expression and it is doing exactly what I want when applied to one row (as a string) but I can't figure out how to apply to each row of the column.

Below is the code that I found to clean a row:

import re

date_div = "The text is interesting but short"

up_to_word = "is"
rx_to_first = r'^.*?{}'.format(re.escape(up_to_word))
print(re.sub(rx_to_first, '', date_div, flags=re.DOTALL).strip())

How to apply it to each row of the column please?

CodePudding user response：

We can try using str.replace here:

df["Text"] = df["Text"].str.replace(r'.*? (?=interesting\b)', '', regex=True)

Here is a regex demo showing that the logic is working.

CodePudding user response：

You can use

rx_to_first = r'(?s)^.*?\b{}\b'.format(re.escape(up_to_word))
df['Text'] = df['Text'].str.replace(rx_to_first, '', regex=True).str.strip()

Note that re.DOTALL can be implemented inside the pattern itself, (?s) makes all . in the pattern match any chars including line break chars.

The df['Text'].str.replace(rx_to_first, '', regex=True) replaces the matches on each row and .str.strip() strips the result from leading/trailing whitespaces.

The \bs around the up_to_word make sure you only match a whole word is.

See a Pandas test:

import pandas as pd
import re

data = ['The text is interesting but short' ,'The text is interesting but short' ,'The text is interesting but short' ,'The text is interesting but short' ,'The text is interesting but short' ,'The text is interesting but short' ]
df = pd.DataFrame(data, columns=['Text'])
up_to_word = "is"
rx_to_first = r'(?s)^.*?\b{}\b'.format(re.escape(up_to_word))
df['Text'] = df['Text'].str.replace(rx_to_first, '', regex=True).str.strip()

Output:

>>> df
                    Text
0  interesting but short
1  interesting but short
2  interesting but short
3  interesting but short
4  interesting but short
5  interesting but short