cnn news webscraper return empty [] without information-CodePudding

so i wrote this code for now:

from urllib import request
from bs4 import BeautifulSoup
import requests
import csv
import re


serch_term = input('What News are you looking for today? ')

url = f'https://edition.cnn.com/search?q={serch_term}'
page = requests.get(url).text
doc = BeautifulSoup(page, "html.parser")

page_text = doc.find_all('<h3 >')
print(page_text)

but im getting empty [] as an result if i print(page_text) does someone can help me

CodePudding user response：

There are several issues:

content is provided dynamically by JavaScript, so you wont get it with requests
We do not know your search term, maybe there are no results
BeautifulSoup is not working with something like <h3 > as selection.

How to fix? Use selenium that works like a browser, renders also JavaScript and could provide you the page_source as expected.

Example

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service

service = Service(executable_path='YOUR PATH TO CHROMEDRIVER')
driver = webdriver.Chrome(service=service)
driver.get('https://edition.cnn.com/search?q=python')

soup = BeautifulSoup(driver.page_source,'html.parser' )
soup.select('h3.cnn-search__result-headline')

Output

[<h3 >
 <a href="//www.cnn.com/travel/article/airasia-malaysia-snake-plane-rerouted-intl-hnk/index.html">AirAsia flight in Malaysia rerouted after snake found on board plane</a>
 </h3>,
 <h3 >
 <a href="//www.cnn.com/2021/11/19/cnn-underscored/athleta-gift-shop-holiday/index.html">With gift options under $50 plus splurge-worthy seasonal staples, Athleta's Gift Shop is a holiday shopping haven</a></h3>,...]

To get the title call the .text methode while iterating your ResultSet and to grab the value of href use ['href'] on its contained <a>