HTML source code I am working on an independent project where I want to scrape all historical data from a cryptocurrency and store in a python pandas df. I have identified the structure of the html page, and have the following code
from bs4 import BeautifulSoup
import urllib3
import requests
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
bitcoin_df = pd.DataFrame(columns = ['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Market Cap'])
bitcoin_url = "https://coinmarketcap.com/currencies/bitcoin/historical-data/"
bitcoin_content = requests.get(bitcoin_url).text
bitcoin_soup = BeautifulSoup(bitcoin_content, "lxml")
#print(bitcoin_soup.prettify())
bitcoin_table = bitcoin_soup.find("table", attrs={"class": "h7vnx2-2 hLKazY cmc-table "})
bitcoin_table_data = bitcoin_table.find_all("tr")
for tr in bitcoin_table_data:
tds = tr.find_all("td")
for td in tds:
bitcoin_df.append({'Date': td[0].text, 'Open': td[1].text, 'High': td[2].text, 'Low': td[3].text, 'Close': td[4].text, 'Volume': td[5].text, 'Market Cap': td[6].text})
However, I encounter this error:
>AttributeError Traceback (most recent call last)
<ipython-input-46-316341b6771b> in <module>
7
8 bitcoin_table = bitcoin_soup.find("table", attrs={"class": "h7vnx2-2 hLKazY cmc-table "})
----> 9 bitcoin_table_data = bitcoin_table.find_all("tr")
10
11 #for tr in bitcoin_soup.find_all('tr'):
>AttributeError: 'NoneType' object has no attribute 'find_all'
CodePudding user response:
You are getting that error because the .find()
called returned None
to indicate it could not locate the table. The table is created by Javascript inside a browser so will not be present.
Rather than trying to parse the HTML, you could just request the data directly from their API (as the browser does). For example:
import pandas as pd
import requests
import time
ts = int(time.time())
json_url = f"https://api.coinmarketcap.com/data-api/v3/cryptocurrency/historical?id=1&convertId=2781&timeStart={ts - 5270400}&timeEnd={ts}"
json_req = requests.get(json_url)
json_data = json_req.json()
data = []
for quote in json_data['data']['quotes']:
data.append([
quote['quote']['timestamp'],
quote['quote']['open'],
quote['quote']['high'],
quote['quote']['low'],
quote['quote']['close'],
quote['quote']['volume'],
quote['quote']['marketCap'],
])
df = pd.DataFrame(data, columns = ['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Market Cap'])
print(df)
Which would give you a dataframe starting:
Date Open High Low Close Volume Market Cap
0 2021-09-13T23:59:59.999Z 46057.215327 46598.678985 43591.320785 44963.072633 4.096994e 10 8.459805e 11
1 2021-09-14T23:59:59.999Z 44960.049359 47218.125355 44752.331349 47092.493833 3.865215e 10 8.860953e 11
2 2021-09-15T23:59:59.999Z 47097.998123 48450.468466 46773.326543 48176.346393 3.048450e 10 9.065325e 11
This URL was found by watching the browser request the data using its own developer tools. I suggest you print(json_data)
to see what was returned.