I'm currently trying to put together an article scraper for a website, but I'm running into an issue that I don't know how to solve. This is the code:
import newspaper
from newspaper import Article
import pandas as pd
import datetime
from datetime import datetime, timezone
import requests
from bs4 import BeautifulSoup
import re
urls = open("urls_test.txt").readlines()
final_df = pd.DataFrame()
for url in urls:
article = newspaper.Article(url="%s" % (url), language='en')
article.download()
article.parse()
article.nlp()
# scrape html part
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="main-content")
texts = results.find_all("div", class_="component article-body-text")
paragraphs = []
for snippet in texts:
paragraphs.append(str(snippet))
CLEANR = re.compile('<.*?>')
def remove_html(input):
cleantext = re.sub(CLEANR, '', input)
return cleantext
paragraphs_string = ' '.join(paragraphs)
paragraphs_clean = remove_html(paragraphs_string)
#
temp_df = pd.DataFrame(columns=['Title', 'Authors', 'Text', 'Summary', 'published_date', 'URL'])
temp_df['Authors'] = article.authors
temp_df['Title'] = article.title
temp_df['Text'] = paragraphs_clean
temp_df['Summary'] = article.meta_description
publish_date = article.publish_date
publish_date = publish_date.replace(tzinfo=None)
temp_df['published_date'] = publish_date
temp_df['URL'] = article.url
final_df = pd.concat([final_df, temp_df], ignore_index=True)
final_df.to_excel('Telegraph_test.xlsx')
My problem appears in the #scrape html part. Both codes (main code without the #scrape html part and only the #scrape html part) run fine on their own. More specifically, the code as a whole runs until line results = soup.find(id="main-content")
(returning the results
variable as a bs4.element.Tag
containing the scraped material), but as it continues the results
variable turns into NoneType
. This is the error message I get:
AttributeError: 'NoneType' object has no attribute 'find_all'
CodePudding user response:
Without knowing any of the urls and structure of HTML I would say that there is one without an element using id="main-content"
as attribute - So you should always check if the element you are looking for is available:
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="main-content")
if results:
text = ' '.join([e.get_text(strip=True) for e in results.find_all("div", class_="component article-body-text")])
else:
text = ''
There is no need for your remove_html()
simply use .get_text()
to extract the text from your element/s.