Home > database >  Python download updated source page in selenium
Python download updated source page in selenium

Time:03-16

I am trying to download the html content from this url https://coinmarketcap.com/historical/20210328/ with this code:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

url = "https://coinmarketcap.com/historical/20210328/"
driver = webdriver.Firefox()
driver.get(url)
time.sleep(2)
driver.find_element_by_css_selector(".cmc-cookie-policy-banner__close").click()
time.sleep(2)
driver.find_element_by_css_selector(".cmc-table-listing__loadmore > button:nth-child(1)").click()
driver.find_element_by_css_selector(".cmc-table-listing__loadmore > button:nth-child(1)").click()
driver.find_element_by_css_selector(".cmc-table-listing__loadmore > button:nth-child(1)").click()
driver.find_element_by_css_selector(".cmc-table-listing__loadmore > button:nth-child(1)").click()
data = driver.page_source
print(data)

I use the click function to press the "load more" button at the bottom of the page as I need not only the first 200 elements, but to reach at least 1000. But when I print the page source, it shows me only the first 200, as if it were stopped at the html content of the first load of the page, and it does not take in account my actions on the page. How can I fix this?

CodePudding user response:

Not really an answer to your question, but some analysis of the web page you're trying to rip reveals it pulls data from this endpoint directly:

https://web-api.coinmarketcap.com/v1/cryptocurrency/listings/historical?convert=USD,USD,BTC&date=2021-03-28&limit=200&start=401

This will return JSON which you can then import into Python more easily.

# import requests module
import requests
 
# Making a get request
response = requests.get('https://web-api.coinmarketcap.com/v1/cryptocurrency/listings/historical?convert=USD,USD,BTC&date=2021-03-28&limit=200&start=401')
 
# print response
print(response)
 
# print json content
print(response.json())

CodePudding user response:

With delays between the clicking on 'load more' and with adding delay after the last click before getting the page_source I see clicking on 'load more' does changing the content of data = driver.page_source
The code below shows me the initial page_source length is 396447 while the final page_source length is 946180

import time

from selenium import webdriver

url = "https://coinmarketcap.com/historical/20210328/"
driver = webdriver.Chrome()
driver.maximize_window()
driver.get(url)
time.sleep(2)
driver.find_element_by_css_selector(".cmc-cookie-policy-banner__close").click()
time.sleep(2)
data = driver.page_source
print(len(data))
# print(data)
driver.find_element_by_css_selector(".cmc-table-listing__loadmore > button:nth-child(1)").click()
time.sleep(2)
driver.find_element_by_css_selector(".cmc-table-listing__loadmore > button:nth-child(1)").click()
time.sleep(2)
driver.find_element_by_css_selector(".cmc-table-listing__loadmore > button:nth-child(1)").click()
time.sleep(2)
driver.find_element_by_css_selector(".cmc-table-listing__loadmore > button:nth-child(1)").click()
time.sleep(2)
data = driver.page_source
print(len(data))
# print(data)
driver.quit()

This code must be improved to remove the redundant hardcoded sleeps but basically it works.

  • Related