I'd like to scrape news headline, link of news and picture of that news.
I try to use web scraping following as below. but It's only headline code and It is not work.
import requests
import pandas as pd
from bs4 import BeautifulSoup
nbc_business = "https://news.mongabay.com/list/environment"
res = requests.get(nbc_business, verify=False)
soup = BeautifulSoup(res.content, 'html.parser')
headlines = soup.find_all('h2',{'class':'post-title-news'})
len(headlines)
for i in range(len(headlines)):
print(headlines[i].text)
Please recommend it to me.
CodePudding user response:
This is because the site blocks bot. If you print the res.content which shows 403.
Add headers={'User-Agent':'Mozilla/5.0'} to the request.
Try the code below,
nbc_business = "https://news.mongabay.com/list/environment"
res = requests.get(nbc_business, verify=False, headers={'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(res.content, 'html.parser')
headlines = soup.find_all('h2', class_='post-title-news')
print(len(headlines))
for i in range(len(headlines)):
print(headlines[i].text)
CodePudding user response:
First things first: never post code as an image.
<h2>
in your HTML has no text
. What it does have, is an <a>
element, so:
for hl in headlines:
link = hl.findChild()
text = link.text
url = link.attrs['href']