Home > OS >  How can I get BeautifulSoup running within another for loop?
How can I get BeautifulSoup running within another for loop?

Time:12-07

I'm currently trying to put together an article scraper for a website, but I'm running into an issue that I don't know how to solve. This is the code:

import newspaper
from newspaper import Article
import pandas as pd
import datetime
from datetime import datetime, timezone
import requests
from bs4 import BeautifulSoup
import re

urls = open("urls_test.txt").readlines()

final_df = pd.DataFrame()

for url in urls:
    article = newspaper.Article(url="%s" % (url), language='en')
    article.download()
    article.parse()
    article.nlp()

    # scrape html part
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")
    results = soup.find(id="main-content")
    texts = results.find_all("div", class_="component article-body-text")
    paragraphs = []
    for snippet in texts:
        paragraphs.append(str(snippet))
    CLEANR = re.compile('<.*?>')
    def remove_html(input):
        cleantext = re.sub(CLEANR, '', input)
        return cleantext
    paragraphs_string = ' '.join(paragraphs)
    paragraphs_clean = remove_html(paragraphs_string)
    #

    temp_df = pd.DataFrame(columns=['Title', 'Authors', 'Text', 'Summary', 'published_date', 'URL'])

    temp_df['Authors'] = article.authors
    temp_df['Title'] = article.title
    temp_df['Text'] = paragraphs_clean
    temp_df['Summary'] = article.meta_description
    publish_date = article.publish_date
    publish_date = publish_date.replace(tzinfo=None)
    temp_df['published_date'] = publish_date
    temp_df['URL'] = article.url

    final_df = pd.concat([final_df, temp_df], ignore_index=True)

final_df.to_excel('Telegraph_test.xlsx')

My problem appears in the #scrape html part. Both codes (main code without the #scrape html part and only the #scrape html part) run fine on their own. More specifically, the code as a whole runs until line results = soup.find(id="main-content") (returning the results variable as a bs4.element.Tag containing the scraped material), but as it continues the results variable turns into NoneType. This is the error message I get:

AttributeError: 'NoneType' object has no attribute 'find_all'

CodePudding user response:

Without knowing any of the urls and structure of HTML I would say that there is one without an element using id="main-content" as attribute - So you should always check if the element you are looking for is available:

soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="main-content")
if results:
    text = ' '.join([e.get_text(strip=True) for e in results.find_all("div", class_="component article-body-text")])
else:
    text = ''
    

There is no need for your remove_html() simply use .get_text() to extract the text from your element/s.

  • Related