Scraping Blog Post Titles with Selenium

I am trying to scrape the blog post titles using Selenium with Python of the following URL: https://blog.coinbase.com/tagged/coinbase-pro. When I use Selenium to get the page source, it does not contain the blog post titles, but the Chrome source code does when I right click and select "view page source". I'm using the following code:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
driver = webdriver.Chrome(options=options)
driver.get("https://blog.coinbase.com/tagged/coinbase-pro")
pageSource = driver.page_source
print(pageSource)

Any help would be appreciated. Thanks.

CodePudding user response：

wait=WebDriverWait(driver,30)                                 
driver.get("https://blog.coinbase.com/tagged/coinbase-pro")
elements=wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".graf.graf--h3.graf-after--figure.graf--trailing.graf--title")))
for elem in elements:
   print(elem.text)

If you wanted the 8 titles you can grab them by their css selector using waits.

Import:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC

Outputs:

Inverse Finance (INV), Liquity (LQTY), Polyswarm (NCT) and Propy (PRO) are launching on Coinbase Pro
Goldfinch Protocol (GFI) is launching on Coinbase Pro
Decentralized Social (DESO) is launching on Coinbase Pro
API3 (API3), Bluezelle (BLZ), Gods Unchained (GODS), Immutable X (IMX), Measurable Data Token (MDT) and Ribbon…
Circuits of Value (COVAL), IDEX (IDEX), Moss Carbon Credit (MCO2), Polkastarter (POLS), ShapeShift FOX Token (FOX)…
Voyager Token (VGX) is launching on Coinbase Pro
Alchemix (ALCX), Ethereum Name Service (ENS), Gala (GALA), mStable USD (MUSD) and Power Ledger (POWR) are launching…
Crypto.com Protocol (CRO) is launching on Coinbase Pro

CodePudding user response：

You can fetch all the titles from that webpage in several ways. The efficient and fastest way would be to opt for requests.

This is how you can grab the titles using requests:

import re
import json
import time
import requests

link = 'https://medium.com/the-coinbase-blog/load-more'
params = {
    'sortBy': 'tagged',
    'tagSlug': 'coinbase-pro',
    'limit': 25,
    'to': int(time.time() * 1000),
}

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
    s.headers['accept'] = 'application/json'
    s.headers['referer'] = 'https://blog.coinbase.com/tagged/coinbase-pro'
    
    while True:
        res = s.get(link,params=params)
        container = json.loads(re.findall("[^{] (.*)",res.text)[0])
        for k,v in container['payload']['references']['Post'].items():
            title = v['title']
            print(title)

        try:
            next_page = container['payload']['paging']['next']['to']
        except KeyError:
            break

        params['to'] = next_page

However, if it is selenium you want to stick with, try the following:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC

def scroll_down_to_the_bottom():
    check_height = driver.execute_script("return document.body.scrollHeight;")
    while True:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        try:
            WebDriverWait(driver,10).until(lambda driver: driver.execute_script("return document.body.scrollHeight;")  > check_height)
            check_height = driver.execute_script("return document.body.scrollHeight;") 
        except TimeoutException:
             break

with webdriver.Chrome() as driver:                          
    driver.get("https://blog.coinbase.com/tagged/coinbase-pro")
    scroll_down_to_the_bottom()
    for item in WebDriverWait(driver,10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".section-content h3.graf--title"))):
       print(item.text)