I am trying to get the body text of news articles like this one:
In the source code, it can be found after "articleBody".
I've tried using bs4 Beautifulsoup but it looks like it cannot access the 'window' object where the article body information is. I'm able to get the text by using string functions:
text = re.search('"articleBody":"(.*)","keywords"', source_code)
Where source_code is a string that contains the source code of the URL. However, this method looks pretty inefficient compared to using the bs4 methods when the page allows it. Any advice, please?
CodePudding user response:
You're right about BeautifulSoup not being able to handle window objects. In fact, you need to use Selenium for that kind of thing. Here's an example on how to do so with Python 3 (you'll have to adapt it slightly if you want to work in Python 2):
from selenium import webdriver
import time
# Create a new instance of Chrome and go to the website we want to scrape
browser = webdriver.Chrome()
browser.get("http://www.elpais.com/")
time.sleep(5) # Let the browser load
# Find the div element containing the article content
div = browser.find_element_by_class_name('articleContent')
# Print out all the text inside the div
print(div.text)
Hope this helps!
CodePudding user response:
Try:
import json
import requests
from bs4 import BeautifulSoup
url = "https://elpais.com/espana/2022-07-01/yolanda-diaz-lanza-su-proyecto-politico-sumar-y-convoca-el-primer-acto.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for ld_json in soup.select('[type="application/ld json"]'):
data = json.loads(ld_json.text)
if "@type" in data and "NewsArticle" in data["@type"]:
break
print(data["articleBody"])
Prints:
A una semana de que arranque Sumar ...
Or:
text = soup.select_one('[data-dtm-region="articulo_cuerpo"]').get_text(
strip=True
)
print(text)