Home > Back-end >  Scraping data from a container
Scraping data from a container

Time:12-11

I'm new to scraping and would like to scrape this the "Historical Data" table from this url: https://coinmarketcap.com/currencies/bitcoin/historical-data/

I have tried to use bs4 but nothing seems to work for me as it just returns an empty list... As far as I understand, what I need to do is to find all "tr" in the container - or what? I don't have that much code, but I think it make sense to show it to you so there is something to work with:

My code:

page = requests.get("https://coinmarketcap.com/currencies/bitcoin/historical-data/")
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('tr')

CodePudding user response:

The data you are looking for is added to the page via XHR/Fetch call. You can get it like the below

import requests

r = requests.get('https://api.coinmarketcap.com/data-api/v3/cryptocurrency/historical?id=1&convertId=2781&timeStart=1633910400&timeEnd=1639180800')
if r.status_code == 200:
  print(r.json())

CodePudding user response:

Expanding on @balderman's answer, you can try this to correctly get it into an a pandas dataframe format:

output = pd.DataFrame(requests.get('https://api.coinmarketcap.com/data-api/v3/cryptocurrency/historical?id=1&convertId=2781&timeStart=1633910400&timeEnd=1639180800').json()['data']['quotes'])

Returning

                    timeOpen  ...                                              quote
0   2021-10-11T00:00:00.000Z  ...  {'open': 54734.124840616, 'high': 57793.039249...
1   2021-10-12T00:00:00.000Z  ...  {'open': 57526.8320114193, 'high': 57627.87860...
2   2021-10-13T00:00:00.000Z  ...  {'open': 56038.2567881108, 'high': 57688.66010...
3   2021-10-14T00:00:00.000Z  ...  {'open': 57372.8320788954, 'high': 58478.73549...
4   2021-10-15T00:00:00.000Z  ...  {'open': 57345.9019791856, 'high': 62757.12970...
..                       ...  ...                                                ...
56  2021-12-06T00:00:00.000Z  ...  {'open': 49413.4790992129, 'high': 50929.51909...
57  2021-12-07T00:00:00.000Z  ...  {'open': 50581.8300495181, 'high': 51934.78189...
58  2021-12-08T00:00:00.000Z  ...  {'open': 50667.6476830609, 'high': 51171.37531...
59  2021-12-09T00:00:00.000Z  ...  {'open': 50450.0820524109, 'high': 50797.16544...
60  2021-12-10T00:00:00.000Z  ...  {'open': 47642.1435531841, 'high': 50015.25298...

Finally using a join() operation we can unned the quote column which contains a dict with the values:

output = output.join(pd.concat([pd.DataFrame([x]) for x in output['quote']]).reset_index(drop=True)).drop(columns='quote')

To obtain it in a nice and clear format:

                    timeOpen                   timeClose                   timeHigh                      timeLow            open            high            low            close          volume       marketCap    timestamp
0   2021-10-11T00:00:00.000Z    2021-10-11T23:59:59.999Z    2021-10-11T19:47:02.000Z    2021-10-11T00:04:02.000Z    54734.124841    57793.039249    54519.765520    57484.789465    4.263733e 10    1.083079e 12    2021-10-11T23:59:59.999Z
1   2021-10-12T00:00:00.000Z    2021-10-12T23:59:59.999Z    2021-10-12T06:14:02.000Z    2021-10-12T20:09:02.000Z    57526.832011    57627.878602    54477.974468    56041.056838    4.108376e 10    1.055926e 12    2021-10-12T23:59:59.999Z
2   2021-10-13T00:00:00.000Z    2021-10-13T23:59:59.999Z    2021-10-13T21:43:02.000Z    2021-10-13T09:10:02.000Z    56038.256788    57688.660104    54370.973228    57401.097527    4.168425e 10    1.081612e 12    2021-10-13T23:59:59.999Z
3   2021-10-14T00:00:00.000Z    2021-10-14T23:59:59.999Z    2021-10-14T02:27:02.000Z    2021-10-14T18:30:02.000Z    57372.832079    58478.735499    56957.076136    57321.525280    3.661579e 10    1.080160e 12    2021-10-14T23:59:59.999Z
4   2021-10-15T00:00:00.000Z    2021-10-15T23:59:59.999Z    2021-10-15T20:28:02.000Z    2021-10-15T01:20:02.000Z    57345.901979    62757.129703    56868.142693    61593.950061    5.178008e 10    1.160726e 12    2021-10-15T23:59:59.999Z
  • Related