Home > OS >  Web scraping 'window' object
Web scraping 'window' object

Time:07-03

I am trying to get the body text of news articles like this one:

https://elpais.com/espana/2022-07-01/yolanda-diaz-lanza-su-proyecto-politico-sumar-y-convoca-el-primer-acto.html

In the source code, it can be found after "articleBody".

I've tried using bs4 Beautifulsoup but it looks like it cannot access the 'window' object where the article body information is. I'm able to get the text by using string functions:

text = re.search('"articleBody":"(.*)","keywords"', source_code)

Where source_code is a string that contains the source code of the URL. However, this method looks pretty inefficient compared to using the bs4 methods when the page allows it. Any advice, please?

CodePudding user response:

You're right about BeautifulSoup not being able to handle window objects. In fact, you need to use Selenium for that kind of thing. Here's an example on how to do so with Python 3 (you'll have to adapt it slightly if you want to work in Python 2):

from selenium import webdriver

import time

# Create a new instance of Chrome and go to the website we want to scrape

browser = webdriver.Chrome()

browser.get("http://www.elpais.com/")

time.sleep(5) # Let the browser load

# Find the div element containing the article content

div = browser.find_element_by_class_name('articleContent')

# Print out all the text inside the div

print(div.text)

Hope this helps!

CodePudding user response:

Try:

import json
import requests
from bs4 import BeautifulSoup


url = "https://elpais.com/espana/2022-07-01/yolanda-diaz-lanza-su-proyecto-politico-sumar-y-convoca-el-primer-acto.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")


for ld_json in soup.select('[type="application/ld json"]'):
    data = json.loads(ld_json.text)
    if "@type" in data and "NewsArticle" in data["@type"]:
        break

print(data["articleBody"])

Prints:

A una semana de que arranque Sumar ...

Or:

text = soup.select_one('[data-dtm-region="articulo_cuerpo"]').get_text(
    strip=True
)

print(text)
  • Related