Due to limitations on historical data on the coinmarketcap api plans, I am seeking to webscrape instead.
However, I am stuck at the first hurdle despite reading the crummy documentation on attributes.
import json
import requests
from bs4 import BeautifulSoup
r = requests.get('https://coinmarketcap.com/historical/20210905/')
soup = BeautifulSoup(r.text, 'lxml')
print(soup)
Contained in the output is the data which I am trying to scrape. The data I am trying to get:
Market Cap, Price and Circulating Supply for BTC at 5th September 2021.
The data appears in the output soon after <script id="__NEXT_DATA__" type="application/json">
and for this reason I thought that using __NEXT_DATA__
as the attribute id
would allow me to access the data. Unfortunately not.
An example of the data structure where the data is contained looks as follows:
"listingHistorical":{"data":[{"id":1,"name":"Bitcoin","symbol":"BTC","slug":"bitcoin","num_market_pairs":8848,"date_added":"2013-04-28T00:00:00.000Z","tags":["mineable","pow","sha-256","store-of-value","state-channels","coinbase-ventures-portfolio","three-arrows-capital-portfolio","polychain-capital-portfolio","binance-labs-portfolio","arrington-xrp-capital","blockchain-capital-portfolio","boostvc-portfolio","cms-holdings-portfolio","dcg-portfolio","dragonfly-capital-portfolio","electric-capital-portfolio","fabric-ventures-portfolio","framework-ventures","galaxy-digital-portfolio","huobi-capital","alameda-research-portfolio","a16z-portfolio","1confirmation-portfolio","winklevoss-capital","usv-portfolio","placeholder-ventures-portfolio","pantera-capital-portfolio","multicoin-capital-portfolio","paradigm-xzy-screener"],"max_supply":21000000,"circulating_supply":18807550,"total_supply":18807550,"platform":null,"cmc_rank":1,"last_updated":"2021-09-05T23:00:00.000Z","quote":{"BTC":{"price":1,"volume_24h":585906.8067215424,"percent_change_1h":0,"percent_change_24h":0,"percent_change_7d":0,"market_cap":18807550,"fully_diluted_market_cap":null,"last_updated":"2021-09-05T23:59:03.000Z"},"USD":{"price":51753.41192620951,"volume_24h":30322676318.63,"percent_change_1h":-0.159917099159,"percent_change_24h":3.621580803777,"percent_change_7d":5.987281074996,"market_cap":973354882472.7817,"last_updated":"2021-09-05T23:00:00.000Z"}},"rank":1,"noLazyLoad":true},
Is there a simply solution for this?
CodePudding user response:
This is just for the listing table, which is fully loaded on the page.
https://coinmarketcap.com/historical/20210905/
-> 20210905 -> 2021-09-05 is the date, just replace by the desired date and it will display the data https://coinmarketcap.com/historical/20210101/
for example, then scrape and extract the JSON data.
CodePudding user response:
You can try something like this:
r = requests.get('https://coinmarketcap.com/historical/20210905/')
soup = BeautifulSoup(r.text)
data = json.loads(soup.find('script', type='application/ld json', id='__NEXT_DATA__').text)
historical_data = data['listingHistorical']['data']
print historical_data