How to scrape headline news, link and image?-CodePudding

I'd like to scrape news headline, link of news and picture of that news.

I try to use web scraping following as below. but It's only headline code and It is not work.

import requests
import pandas as pd
from bs4 import BeautifulSoup

nbc_business = "https://news.mongabay.com/list/environment"
res = requests.get(nbc_business, verify=False)
soup = BeautifulSoup(res.content, 'html.parser')

headlines = soup.find_all('h2',{'class':'post-title-news'})
len(headlines)
for i in range(len(headlines)):
    print(headlines[i].text)

Please recommend it to me.

CodePudding user response：

This is because the site blocks bot. If you print the res.content which shows 403.

Add headers={'User-Agent':'Mozilla/5.0'} to the request.

Try the code below,

nbc_business = "https://news.mongabay.com/list/environment"
res = requests.get(nbc_business, verify=False, headers={'User-Agent':'Mozilla/5.0'})

soup = BeautifulSoup(res.content, 'html.parser')

headlines = soup.find_all('h2', class_='post-title-news')
print(len(headlines))
for i in range(len(headlines)):
   print(headlines[i].text)

CodePudding user response：

First things first: never post code as an image.

<h2> in your HTML has no text. What it does have, is an <a> element, so:

 for hl in headlines:
     link = hl.findChild()
     text = link.text
     url = link.attrs['href']