Home > Enterprise >  Formatting scraped content
Formatting scraped content

Time:06-28

I am trying to scrape the title, date and article content from bbc news article when given a url.

The article content is inside multiple divs with the ssrcss-1q0x1qg-Paragraph css class. The article content goes into the content variable. Without the for loop, content is only assigned the content of the first div.

import requests 
from bs4 import BeautifulSoup

def scraper():
  link = 'https://www.bbc.co.uk/news/uk-52255054'

  page = requests.get(link)

  soup = BeautifulSoup(page.content, 'html.parser')
  results = soup.find(class_='ssrcss-pv1rh6-ArticleWrapper')
  body = results.find_all(attrs={'class': 'ssrcss-1q0x1qg-Paragraph'})

  content = []

  for div in body:
      paras = div.text
      content.append(paras)
    
  return content

If I just print the scraped content to the console the formatting is perfect, but when I assign and append it to the content variable it ends up with random \ characters and with extra ', ' since it is a list. See below excerpt.

"\"Coronavirus will not overcome us,\" the Queen has said, in an Easter message to the nation.",
        "While celebrations would be different for many this year, she said: \"We need Easter as much as ever.\"",
        "Referencing the tradition of lighting candles to mark the occasion, she said: \"As dark as death can be - particularly for those suffering with grief - light and life are greater.\"",
        "It comes as the number of coronavirus deaths in UK hospitals reached 9,875.",
        "Speaking from Windsor Castle, the Queen said many religions had festivals celebrating light overcoming darkness, which often featured the lighting of candles.",
        "She said: \"They seem to speak to every culture, and appeal to people of all faiths, and of none.",
        "\"They are lit on birthday cakes and to mark family anniversaries, when we gather happily around a source of light. It unites us.\"",

Is there a way to remove these to just make the list one block of text?

Thanks

CodePudding user response:

  • The \ is meant for escaping quotes.
  • The , separating items in a list are not part of the items themselves (in this case strings).

If you want to get the full text, with each string in content separated by a whitespace, add these lines at the end of your code:

full_text = ' '.join(scraper())
print(full_text)
  • Related