I am trying to make a web crawler - scraper to get the news. I want to remove elements that are in a specific class. But, the problem is that this class is nested in another class. The code is below:

import requests
from bs4 import BeautifulSoup

url = 'https://www.moneyreview.gr/life-and-arts/86916/mia-apli-lysi-gia-to-rochalito-to- 
kolpo-poy-sozei-chiliades-gamoys/'

r1 = requests.get(url)
coverpage = r1.content
soup1 = BeautifulSoup(coverpage, 'html5lib')
title = soup1.find('h1').get_text()
article = requests.get(url)
article_content = article.content

soup_article = BeautifulSoup(article_content, 'html5lib')
body = soup_article.find_all('div', class_='entry-content')

The unwanted elements Inside the text of the article there is also the text of a tweet. I want to remove this text and all twitter tags etc from the article text so that I have a clean text. I wrote this code to print everything inside this twitter tag:

for elements in body:
   quote = soup1.find_all('blockquote', class_= "twitter-tweet")
   print(quote)

I get this result :

enter image description here

With the code below I put the paragraphs of the text in a list:

x = body[0].find_all('p')
list_paragraphs = []

for p in np.arange(0, len(x)):
    paragraph = x[p].text.replace("\n", " ")
    list_paragraphs.append(paragraph)

Where the problem is: I want everything inside the list quote to be removed from the list list_paragraphs. But all I tried so far, failed.

my_list = []

for i in quote:
   if i:
       my_list.append(i.text.strip())
print(my_list)

enter image description here

Attempt 1

l3 = [x for x in list_paragraphs if x not in my_list]
print(l3)

Attempt 2

for element in my_list:
    if element in list_paragraphs:
        list_paragraphs.remove(element)

Can you suggest something to do?

CodePudding user response：

If I understand correctly, you just want all the text inside the <p> tags, unless it is enclosed in a <blockquote> with the class twitter-tweet. If that is the case, I would say, the easiest way to accomplish this, is to simply get rid of all the offending <blockquote> tags. You can do that with decompose for instance. Basically you find all those tags as you already did, using find_all and then call .decompose() on each one.

I took the liberty of optimizing your code a bit, as I understood it. Here is my suggestion:

import requests
from bs4 import BeautifulSoup


url = ...
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html5lib')
for quote in soup.find_all('blockquote', class_="twitter-tweet"):
    quote.decompose()
content_div = soup.find('div', class_='entry-content')
list_paragraphs = []
for p in content_div.find_all('p'):
    text = p.get_text(strip=True).replace("\n", " ")
    if text:
        list_paragraphs.append(text)

print(list_paragraphs)

In your original code you made the same request to that url twice for some reason, so I reduced it to a single request.
When you know that you always want the first <div> with the class entry-content, you can use find instead of find_all.
Calling .get_text(strip=True) on each paragraph tag strips the text of all whitespace.
I thought it would make little sense to keep empty strings in your list_paragraphs, so inside the loop we only append a text, if it is not empty.

Hope this helps.

CodePudding user response：

Another approach would be using extract to remove twitter content as shown below:

import requests
from bs4 import BeautifulSoup

url = 'your_url'
unwanted_tags = ['blockquote']

soup = BeautifulSoup(requests.get(url).text, 'html.parser')
# remove tweets from content
for tag in unwanted_tags: [i.extract() for i in soup(tag)]

main_content = [i for i in list(map(lambda x: x.get_text(), soup.find_all('p'))) if i not in ['','\n']]
print(''.join(main_content))