How to scrape news articles from cnbc with keyword "Green hydrogen"?-CodePudding

I am trying to scrap news article listed in this url, all article are in span.Card-title. But this gives blank output. Is there any to resolve this?

from bs4 import BeautifulSoup as soup

import requests

cnbc_url = "https://www.cnbc.com/search/?query=green hydrogen&qsearchterm=green hydrogen"

html = requests.get(cnbc_url)

bsobj = soup(html.content,'html.parser')

day = bsobj.find(id="root")

print(day.find_all('span',class_='Card-title'))

for link in bsobj.find_all('span',class_='Card-title'):

    print('Headlines : {}'.format(link.text))

CodePudding user response：

As mentioned in another answer, the data about the articles are loaded using another link, which you can find via the networks tab in devtools. [In chrome, you can open devtools with Ctrl Shift I, then go to the networks tab to see requests made, and then click on the name starting with 'json.aspx?...' to see details, then copy the Request URL from Headers section.]

Once you have the Request URL, you can copy it and make the request in your code to get the data:

# dataReqUrl contains the copied Request URL
dataReq = requests.get(dataReqUrl)
for r in dataReq.json()['results']: print(r['cn:title'])

If you don't feel like trying to find that one request in 250 other requests, you might also try to assemble a shorter form of the url with something like:

# import urllib.parse

# find link to js file with api key
jsLinks = bsobj.select('link[href][rel="preload"]')
jUrl = [m.get('href') for m in jsLinks if 'main' in m.get('href')][0]

jRes = requests.get(jUrl) # request js file api key

# get api key from javascript
qKey = jRes.text.replace(' ', '').split(
    'QUERYLY_KEY:'
)[-1].split(',')[0].replace('"', '').strip()

# form url
qParams = {
    'queryly_key': qKey,
    'query': search_for, # = 'green hydrogen'
    'batchsize': 10 # can go up to 100 apparently
}
qUrlParams = urllib.parse.urlencode(qParams, quote_via=urllib.parse.quote)
dataReqUrl = f'https://api.queryly.com/cnbc/json.aspx?{qUrlParams}'

Even though the assembled dataReqUrl is not identical to the copied one, it seems to be giving the same results (I checked with a few different search terms). However, I don't know how reliable this method is, especially compared to the much less convoluted approach with selenium:

# from selenium import webdriver
# from selenium.webdriver.common.by import By
# from selenium.webdriver.support.ui import WebDriverWait
# from selenium.webdriver.support import expected_conditions as EC

# define chromeDriver_path <-- where you saved 'chromedriver.exe'
cnbc_url = "https://www.cnbc.com/search/?query=green hydrogen&qsearchterm=green hydrogen"
driver = webdriver.Chrome(chromeDriver_path)
driver.get(cnbc_url)

ctSelector = 'span.Card-title'
WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located(
        (By.CSS_SELECTOR, ctSelector)))
cardTitles = driver.find_elements(By.CSS_SELECTOR, ctSelector)

cardTitles_text = [ct.get_attribute('innerText') for ct in cardTitles] 
for c in cardTitles_text: print(c)

In my opinion, this approach is more reliable as well as simpler.

CodePudding user response：

The problem is that content is not present on page when it loads initially, only afterwards is it fetched from server using url like this

https://api.queryly.com/cnbc/json.aspx?queryly_key=31a35d40a9a64ab3&query=green hydrogen&endindex=0&batchsize=10&callback=&showfaceted=false&timezoneoffset=-240&facetedfields=formats&facetedkey=formats|&facetedvalue=!Press Release|&needtoptickers=1&additionalindexes=4cd6f71fbf22424d,937d600b0d0d4e23,3bfbe40caee7443e,626fdfcd96444f28

and added to page.

Take a look at /json.aspx endpoint in devtools, data seems to be there.