I want to format a text-column in the dataframe in a following way:
In entries where the last character of a string is a colon ":" I want to delete the last sentence in this text i.e. a substring starting from a character after the last ".", "?" or "!" and finishing on that colon.
Example df:
index text
1 Trump met with Putin. Learn more here:
2 New movie by Christopher Nolan! Watch here:
3 Campers: Get ready to stop COVID-19 in its tracks!
4 London was building a bigger rival to the Eiffel Tower. Then it all went wrong.
after formatting should look like this:
index text
1 Trump met with Putin.
2 New movie by Christopher Nolan!
3 Campers: Get ready to stop COVID-19 in its tracks!
4 London was building a bigger rival to the Eiffel Tower. Then it all went wrong.
CodePudding user response:
lets do it with regex to have more problems
df.text = df.text.str.replace(r"(?<=[.!?]).*?:\s*$", "", regex=True)
now df.text.tolist()
is
['Trump met with Putin.',
'New movie by Christopher Nolan!',
'Campers: Get ready to stop COVID-19 in its tracks!',
'London was building a bigger rival to the Eiffel Tower. Then it all went wrong.']
variable lookbehind ftw
CodePudding user response:
Using sent_tokenize
from the NLTK tokenize API which IMO is the idiomatic way of tokenizing sentences
from nltk.tokenize import sent_tokenize
(df['text'].map(nltk.sent_tokenize)
.map(lambda sent: ' '.join([s for s in sent if not s.endswith(':')])))
index
1 Trump met with Putin.
2 New movie by Christopher Nolan.
3 Campers: Get ready to stop COVID-19 in its tra...
4 London was building a bigger rival to the Eiff...
Name: text, dtype: object
You might have to handle NaNs appropriately with a preceeding fillna('')
call if your column contains those.
In list form the output looks like this:
['Trump met with Putin.',
'New movie by Christopher Nolan.',
'Campers: Get ready to stop COVID-19 in its tracks!',
'London was building a bigger rival to the Eiffel Tower. Then it all went wrong.']
Note that NLTK needs to be pip-installed.