Home > Net >  cnn news webscraper return empty [] without information
cnn news webscraper return empty [] without information

Time:03-15

so i wrote this code for now:

from urllib import request
from bs4 import BeautifulSoup
import requests
import csv
import re


serch_term = input('What News are you looking for today? ')

url = f'https://edition.cnn.com/search?q={serch_term}'
page = requests.get(url).text
doc = BeautifulSoup(page, "html.parser")

page_text = doc.find_all('<h3 >')
print(page_text)

but im getting empty [] as an result if i print(page_text) does someone can help me

CodePudding user response:

There are several issues:

  • content is provided dynamically by JavaScript, so you wont get it with requests

  • We do not know your search term, maybe there are no results

  • BeautifulSoup is not working with something like <h3 > as selection.

How to fix? Use selenium that works like a browser, renders also JavaScript and could provide you the page_source as expected.

Example

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service

service = Service(executable_path='YOUR PATH TO CHROMEDRIVER')
driver = webdriver.Chrome(service=service)
driver.get('https://edition.cnn.com/search?q=python')

soup = BeautifulSoup(driver.page_source,'html.parser' )
soup.select('h3.cnn-search__result-headline')

Output

[<h3 >
 <a href="//www.cnn.com/travel/article/airasia-malaysia-snake-plane-rerouted-intl-hnk/index.html">AirAsia flight in Malaysia rerouted after snake found on board plane</a>
 </h3>,
 <h3 >
 <a href="//www.cnn.com/2021/11/19/cnn-underscored/athleta-gift-shop-holiday/index.html">With gift options under $50 plus splurge-worthy seasonal staples, Athleta's Gift Shop is a holiday shopping haven</a></h3>,...]

To get the title call the .text methode while iterating your ResultSet and to grab the value of href use ['href'] on its contained <a>

  • Related