I'm new to scraping and would like to scrape this the "Historical Data" table from this url: https://coinmarketcap.com/currencies/bitcoin/historical-data/
I have tried to use bs4 but nothing seems to work for me as it just returns an empty list... As far as I understand, what I need to do is to find all "tr" in the container - or what? I don't have that much code, but I think it make sense to show it to you so there is something to work with:
My code:
page = requests.get("https://coinmarketcap.com/currencies/bitcoin/historical-data/")
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('tr')
CodePudding user response:
The data you are looking for is added to the page via XHR/Fetch call. You can get it like the below
import requests
r = requests.get('https://api.coinmarketcap.com/data-api/v3/cryptocurrency/historical?id=1&convertId=2781&timeStart=1633910400&timeEnd=1639180800')
if r.status_code == 200:
print(r.json())
CodePudding user response:
Expanding on @balderman's answer, you can try this to correctly get it into an a pandas dataframe format:
output = pd.DataFrame(requests.get('https://api.coinmarketcap.com/data-api/v3/cryptocurrency/historical?id=1&convertId=2781&timeStart=1633910400&timeEnd=1639180800').json()['data']['quotes'])
Returning
timeOpen ... quote
0 2021-10-11T00:00:00.000Z ... {'open': 54734.124840616, 'high': 57793.039249...
1 2021-10-12T00:00:00.000Z ... {'open': 57526.8320114193, 'high': 57627.87860...
2 2021-10-13T00:00:00.000Z ... {'open': 56038.2567881108, 'high': 57688.66010...
3 2021-10-14T00:00:00.000Z ... {'open': 57372.8320788954, 'high': 58478.73549...
4 2021-10-15T00:00:00.000Z ... {'open': 57345.9019791856, 'high': 62757.12970...
.. ... ... ...
56 2021-12-06T00:00:00.000Z ... {'open': 49413.4790992129, 'high': 50929.51909...
57 2021-12-07T00:00:00.000Z ... {'open': 50581.8300495181, 'high': 51934.78189...
58 2021-12-08T00:00:00.000Z ... {'open': 50667.6476830609, 'high': 51171.37531...
59 2021-12-09T00:00:00.000Z ... {'open': 50450.0820524109, 'high': 50797.16544...
60 2021-12-10T00:00:00.000Z ... {'open': 47642.1435531841, 'high': 50015.25298...
Finally using a join()
operation we can unned the quote
column which contains a dict with the values:
output = output.join(pd.concat([pd.DataFrame([x]) for x in output['quote']]).reset_index(drop=True)).drop(columns='quote')
To obtain it in a nice and clear format:
timeOpen timeClose timeHigh timeLow open high low close volume marketCap timestamp
0 2021-10-11T00:00:00.000Z 2021-10-11T23:59:59.999Z 2021-10-11T19:47:02.000Z 2021-10-11T00:04:02.000Z 54734.124841 57793.039249 54519.765520 57484.789465 4.263733e 10 1.083079e 12 2021-10-11T23:59:59.999Z
1 2021-10-12T00:00:00.000Z 2021-10-12T23:59:59.999Z 2021-10-12T06:14:02.000Z 2021-10-12T20:09:02.000Z 57526.832011 57627.878602 54477.974468 56041.056838 4.108376e 10 1.055926e 12 2021-10-12T23:59:59.999Z
2 2021-10-13T00:00:00.000Z 2021-10-13T23:59:59.999Z 2021-10-13T21:43:02.000Z 2021-10-13T09:10:02.000Z 56038.256788 57688.660104 54370.973228 57401.097527 4.168425e 10 1.081612e 12 2021-10-13T23:59:59.999Z
3 2021-10-14T00:00:00.000Z 2021-10-14T23:59:59.999Z 2021-10-14T02:27:02.000Z 2021-10-14T18:30:02.000Z 57372.832079 58478.735499 56957.076136 57321.525280 3.661579e 10 1.080160e 12 2021-10-14T23:59:59.999Z
4 2021-10-15T00:00:00.000Z 2021-10-15T23:59:59.999Z 2021-10-15T20:28:02.000Z 2021-10-15T01:20:02.000Z 57345.901979 62757.129703 56868.142693 61593.950061 5.178008e 10 1.160726e 12 2021-10-15T23:59:59.999Z