Home > front end >  Removing a sentence from a text in dataframe column
Removing a sentence from a text in dataframe column

Time:02-13

I want to format a text-column in the dataframe in a following way:

In entries where the last character of a string is a colon ":" I want to delete the last sentence in this text i.e. a substring starting from a character after the last ".", "?" or "!" and finishing on that colon.

Example df:

index    text
1        Trump met with Putin. Learn more here:
2        New movie by Christopher Nolan! Watch here:
3        Campers: Get ready to stop COVID-19 in its tracks!
4        London was building a bigger rival to the Eiffel Tower. Then it all went wrong.

after formatting should look like this:

index    text
1        Trump met with Putin.
2        New movie by Christopher Nolan!
3        Campers: Get ready to stop COVID-19 in its tracks!
4        London was building a bigger rival to the Eiffel Tower. Then it all went wrong.

CodePudding user response:

lets do it with regex to have more problems

df.text = df.text.str.replace(r"(?<=[.!?]).*?:\s*$", "", regex=True)

now df.text.tolist() is

['Trump met with Putin.',
 'New movie by Christopher Nolan!',
 'Campers: Get ready to stop COVID-19 in its tracks!',
 'London was building a bigger rival to the Eiffel Tower. Then it all went wrong.']

variable lookbehind ftw

CodePudding user response:

Using sent_tokenize from the NLTK tokenize API which IMO is the idiomatic way of tokenizing sentences

from nltk.tokenize import sent_tokenize
(df['text'].map(nltk.sent_tokenize)
           .map(lambda sent: ' '.join([s for s in sent if not s.endswith(':')])))

index
1                                Trump met with Putin.
2                      New movie by Christopher Nolan.
3    Campers: Get ready to stop COVID-19 in its tra...
4    London was building a bigger rival to the Eiff...
Name: text, dtype: object

You might have to handle NaNs appropriately with a preceeding fillna('') call if your column contains those.

In list form the output looks like this:

['Trump met with Putin.',
 'New movie by Christopher Nolan.',
 'Campers: Get ready to stop COVID-19 in its tracks!',
 'London was building a bigger rival to the Eiffel Tower. Then it all went wrong.']

Note that NLTK needs to be pip-installed.

  • Related