Selenium: Web-Scraping Historical Data from Coincodex and transform into a Pandas Dataframe-CodePudding

I do struggle when trying to scrape some historical data from several sites with Selenium from https://coincodex.com/crypto/bitcoin/historical-data/. Somehow I do fail with the following steps:

Get the data from the subsequent pages (not only for September, which is page 1)
Replace '$ ' with '$' for each value
Switch the value B (for billion) into a full number (1B into 1000000000)

The predefined task is: Web-Scrape all data since beginning of the year until end of September with Selenium and BeautifulSoup and transform into a pandas df. My code so far is:

from selenium import webdriver
import time

URL = "https://coincodex.com/crypto/bitcoin/historical-data/"

driver = webdriver.Chrome(executable_path = "/usr/local/bin/chromedriver")
driver.get(URL)
time.sleep(2)

webpage = driver.page_source

from bs4 import BeautifulSoup
Web page fetched from driver is parsed using Beautiful Soup.
HTMLPage = BeautifulSoup(driver.page_source, 'html.parser')

Table = HTMLPage.find('table', class_='styled-table full-size-table')

Rows = Table.find_all('tr', class_='ng-star-inserted')
len(Rows)

# Empty list is created to store the data
extracted_data = []
# Loop to go through each row of table
for i in range(0, len(Rows)):
 try:
  # Empty dictionary to store data present in each row
  RowDict = {}
  # Extracted all the columns of a row and stored in a variable
  Values = Rows[i].find_all('td')
  
  # Values (Open, High, Close etc.) are extracted and stored in dictionary
  if len(Values) == 7:
   RowDict["Date"] = Values[0].text.replace(',', '')
   RowDict["Open"] = Values[1].text.replace(',', '')
   RowDict["High"] = Values[2].text.replace(',', '')
   RowDict["Low"] = Values[3].text.replace(',', '')
   RowDict["Close"] = Values[4].text.replace(',', '')
   RowDict["Volume"] = Values[5].text.replace(',', '')
   RowDict["Market Cap"] = Values[6].text.replace(',', '')
   extracted_data.append(RowDict)
 except:
  print("Row Number: "   str(i))
 finally:
  # To move to the next row
  i = i   1

extracted_data = pd.DataFrame(extracted_data)
print(extracted_data)

Sorry I'm new to Python and Web-Scraping and I hope, someone can help me. Would be very much appreciated.

CodePudding user response：

Coincodex provides a query UI in which you can adjust the time range. After setting the start and end to the first of January and the 30th of September and clicking the "select" button, the site sends a GET request to the backend, using an endpoint of https://coincodex.com/api/coincodexcoins/get_historical_data_by_slug/bitcoin/2021-1-1/2021-9-30/1?t=5459791. If you send a request to this URL, you can get back all the data you need from this interval:

import requests, json
import pandas as pd
data = json.loads(requests.get('https://coincodex.com/api/coincodexcoins/get_historical_data_by_slug/bitcoin/2021-1-1/2021-9-30/1?t=5459791').text)
df = pd.DataFrame(data['data'])

Output:

              time_start             time_end  price_open_usd  ...  price_avg_ETH    volume_ETH  market_cap_ETH
0    2021-01-01 00:00:00  2021-01-02 00:00:00    28938.896888  ...      39.496780  8.728544e 07    7.341417e 08
1    2021-01-02 00:00:00  2021-01-03 00:00:00    29329.695772  ...      40.934106  9.351177e 07    7.608959e 08
2    2021-01-03 00:00:00  2021-01-04 00:00:00    32148.048500  ...      38.970510  1.448755e 08    7.244327e 08
3    2021-01-04 00:00:00  2021-01-05 00:00:00    32949.399464  ...      31.433580  1.292715e 08    5.843597e 08
4    2021-01-05 00:00:00  2021-01-06 00:00:00    32023.293433  ...      30.478852  1.186652e 08    5.666423e 08
..                   ...                  ...             ...  ...            ...           ...             ...
268  2021-09-26 00:00:00  2021-09-27 00:00:00    42670.363351  ...      14.438247  1.573066e 07    2.718238e 08
269  2021-09-27 00:00:00  2021-09-28 00:00:00    43204.962300  ...      14.157527  1.660821e 07    2.665518e 08
270  2021-09-28 00:00:00  2021-09-29 00:00:00    42111.843283  ...      14.439326  1.782125e 07    2.718712e 08
271  2021-09-29 00:00:00  2021-09-30 00:00:00    41004.598500  ...      14.510256  1.748895e 07    2.732201e 08
272  2021-09-30 00:00:00  2021-10-01 00:00:00    41536.594100  ...      14.454206  1.810257e 07    2.721773e 08

[273 rows x 23 columns]

CodePudding user response：

To extract Bitcoin (BTC) Historical Data from all the seven columns from the website Coincodex and print them in a text file you need to induce WebDriverWait for the visibility_of_all_elements_located() and then using List Comprehension you can create a list and subsequently create a DataFrame and finally export the values to a TEXT file excluding the Index using the following Locator Strategies:

Code Block:

driver.get("https://coincodex.com/crypto/bitcoin/historical-data/")
headers = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table th")))]
dates = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(1)")))]
opens = [my_elem.text.replace('\u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(2)")))]
highs = [my_elem.text.replace('\u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(3)")))]
lows = [my_elem.text.replace('\u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(4)")))]
closes = [my_elem.text.replace('\u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(5)")))]
volumes = [my_elem.text.replace('\u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(6)")))]
marketcaps = [my_elem.text.replace('\u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(7)")))]
my_list = [[headers], [dates], [opens], [highs], [lows], [closes], [volumes]]
df = pd.DataFrame(my_list)
print(df)

Note : You have to add the following imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

Console Output:

0  [Date, Open, High, Low, Close, Volume, Market ...
1  [Oct 27, 2021, Oct 28, 2021, Oct 29, 2021, Oct...
2  [$ 60,332, $ 58,438, $ 60,600, $ 62,225, $ 61,...
3  [$ 61,445, $ 61,940, $ 62,945, $ 62,225, $ 62,...
4  [$ 58,300, $ 58,240, $ 60,341, $ 60,860, $ 60,...
5  [$ 58,681, $ 60,439, $ 62,220, $ 61,661, $ 61,...
6  [$ 84.44B, $ 99.67B, $ 86.79B, $ 82.73B, $ 74....

References

You can find a relevant detailed discussion in:

Python Selenium: How do I print the values from a website in a text file?