I am trying to make a web crawler - scraper to get the news. I want to remove elements that are in a specific class. But, the problem is that this class is nested in another class. The code is below:
import requests
from bs4 import BeautifulSoup
url = 'https://www.moneyreview.gr/life-and-arts/86916/mia-apli-lysi-gia-to-rochalito-to-
kolpo-poy-sozei-chiliades-gamoys/'
r1 = requests.get(url)
coverpage = r1.content
soup1 = BeautifulSoup(coverpage, 'html5lib')
title = soup1.find('h1').get_text()
article = requests.get(url)
article_content = article.content
soup_article = BeautifulSoup(article_content, 'html5lib')
body = soup_article.find_all('div', class_='entry-content')
The unwanted elements Inside the text of the article there is also the text of a tweet. I want to remove this text and all twitter tags etc from the article text so that I have a clean text. I wrote this code to print everything inside this twitter tag:
for elements in body:
quote = soup1.find_all('blockquote', class_= "twitter-tweet")
print(quote)
I get this result :
With the code below I put the paragraphs of the text in a list:
x = body[0].find_all('p')
list_paragraphs = []
for p in np.arange(0, len(x)):
paragraph = x[p].text.replace("\n", " ")
list_paragraphs.append(paragraph)
Where the problem is:
I want everything inside the list quote
to be removed from the list list_paragraphs
.
But all I tried so far, failed.
my_list = []
for i in quote:
if i:
my_list.append(i.text.strip())
print(my_list)
Attempt 1
l3 = [x for x in list_paragraphs if x not in my_list]
print(l3)
Attempt 2
for element in my_list:
if element in list_paragraphs:
list_paragraphs.remove(element)
Can you suggest something to do?
CodePudding user response:
If I understand correctly, you just want all the text inside the <p>
tags, unless it is enclosed in a <blockquote>
with the class twitter-tweet
. If that is the case, I would say, the easiest way to accomplish this, is to simply get rid of all the offending <blockquote>
tags. You can do that with decompose
for instance. Basically you find all those tags as you already did, using find_all
and then call .decompose()
on each one.
I took the liberty of optimizing your code a bit, as I understood it. Here is my suggestion:
import requests
from bs4 import BeautifulSoup
url = ...
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html5lib')
for quote in soup.find_all('blockquote', class_="twitter-tweet"):
quote.decompose()
content_div = soup.find('div', class_='entry-content')
list_paragraphs = []
for p in content_div.find_all('p'):
text = p.get_text(strip=True).replace("\n", " ")
if text:
list_paragraphs.append(text)
print(list_paragraphs)
- In your original code you made the same request to that url twice for some reason, so I reduced it to a single request.
- When you know that you always want the first
<div>
with the classentry-content
, you can usefind
instead offind_all
. - Calling
.get_text(strip=True)
on each paragraph tag strips the text of all whitespace. - I thought it would make little sense to keep empty strings in your
list_paragraphs
, so inside the loop we only append atext
, if it is not empty.
Hope this helps.
CodePudding user response:
Another approach would be using extract
to remove twitter content as shown below:
import requests
from bs4 import BeautifulSoup
url = 'your_url'
unwanted_tags = ['blockquote']
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
# remove tweets from content
for tag in unwanted_tags: [i.extract() for i in soup(tag)]
main_content = [i for i in list(map(lambda x: x.get_text(), soup.find_all('p'))) if i not in ['','\n']]
print(''.join(main_content))