Home > Mobile >  How to remove sentences with a specific character?
How to remove sentences with a specific character?

Time:11-16

I have a dataframe with article texts. One row, among others, has several sentences with the copyright symbol, "©".

article_texts
© Aaron Davidson/Getty Images Aaron Davidson/Getty Images Beyond Meat cuts 19% of workforce including disgraced COO, according to a release from the company. CEO Ethan Brown says the plant-based company is 'significantly reducing expenses' in an effort to focus on growth. It was one of the best fast food meals I've ever had. 6/25 SLIDES © Mary Meisenzahl/Insider The plant-based protein wasn't meant to be indistinguishable from Taco Bell's signature beef, but "equally cravable," according to Taco Bell's director of global nutrition & sustainability Missy Schaaphok. 22/25 SLIDES © Diana G./Yelp In 2019, Taco Bell North America president Julie Felss Masino publicly said that the chain was relying on its own vegetarian options instead of creating new plant-based meat substitutes. Although it remains unclear exactly how many employees were let go, the company ended 2021 with about 1,100 employees.

I want to remove the sentences in the row only with the copyright symbol and I want to do this for every row in the dataset. This is what I want it to look like:

article_texts
CEO Ethan Brown says the plant-based company is 'significantly reducing expenses' in an effort to focus on growth. It was one of the best fast food meals I've ever had. /Yelp In 2019, Taco Bell North America president Julie Felss Masino publicly said that the chain was relying on its own vegetarian options instead of creating new plant-based meat substitutes. Although it remains unclear exactly how many employees were let go, the company ended 2021 with about 1,100 employees.

This is what I tried:

for i in df['article_texts']:
try:
 paragraph = i
 tokens = paragraph.split(".")
 for sentence in tokens:
  if "©" in sentence:
   tokens.remove(sentence)
   final = (".").join(tokens)
   df['summaries'].loc[(df['summaries'] == i)] = final
except:
 print("Yeah, we good.")

Yet, I still get this:

article_texts
CEO Ethan Brown says the plant-based company is 'significantly reducing expenses' in an effort to focus on growth. It was one of the best fast food meals I've ever had. 6/25 SLIDES © Mary Meisenzahl/Insider The plant-based protein wasn't meant to be indistinguishable from Taco Bell's signature beef, but "equally cravable," according to Taco Bell's director of global nutrition & sustainability Missy Schaaphok. 22/25 SLIDES © Diana G./Yelp In 2019, Taco Bell North America president Julie Felss Masino publicly said that the chain was relying on its own vegetarian options instead of creating new plant-based meat substitutes. Although it remains unclear exactly how many employees were let go, the company ended 2021 with about 1,100 employees.

What am I doing wrong?

CodePudding user response:

I will share simple proccess.

Replace © with mask #

split string by .

delete elemnst using list compression

text ="""© Aaron Davidson/Getty Images Aaron Davidson/Getty Images Beyond Meat cuts 19% of workforce including disgraced COO, according to a release from the company. CEO Ethan Brown says the plant-based company is 'significantly reducing expenses' in an effort to focus on growth. It was one of the best fast food meals I've ever had. 6/25 SLIDES © Mary Meisenzahl/Insider The plant-based protein wasn't meant to be indistinguishable from Taco Bell's signature beef, but "equally cravable," according to Taco Bell's director of global nutrition & sustainability Missy Schaaphok. 22/25 SLIDES © Diana G./Yelp In 2019, Taco Bell North America president Julie Felss Masino publicly said that the chain was relying on
its own vegetarian options instead of creating new plant-based meat substitutes. Although it remains unclear exactly how
many employees were let go, the company ended 2021 with about 1,100 employees."""
my_list = text.replace("©", 'mask')
my_list = my_list.split(".")


mask = ['mask']

filtered = ([el for el in my_list if not any(ignore in el for ignore in mask)])
print(filtered)

output List #

[" CEO Ethan Brown says the plant-based company is 'significantly reducing expenses' in an effort to focus on growth", " It was one of the best fast food meals I've ever had", '/Yelp In 2019, Taco Bell North America president Julie Felss Masino publicly said that the chain was relying on\nits own vegetarian options instead of creating new plant-based meat substitutes', ' Although it remains unclear exactly how\nmany employees were let go, the company ended 2021 with about 1,100 employees', '']

Join list

filtered ='. '.join(filtered)

output #

CEO Ethan Brown says the plant-based company is 'significantly reducing 
expenses' in an effort to focus on growth.  It was one of the best fast 
food meals I've ever had. /Yelp In 2019, Taco Bell North America 
president Julie Felss Masino publicly said that the chain was relying on
its own vegetarian options instead of creating new plant-based meat 
substitutes.  Although it remains unclear exactly how
many employees were let go, the company ended 2021 with about 1,100 
employees. 

CodePudding user response:

I'd like to expand a bit on other folks' answers. Any problem requiring the conversion of column values is a great candidate for using .map(). I claim this makes for more readable code.

def remove_sentences_with_copyright(paragraph):
    return '.'.join(sentence for sentence in paragraph.split(".") if "©" not in sentence)

df['summaries'] = df['article_texts'].map(remove_sentences_with_copyright)
  • Related