I'm trying for a school project to scrape data from the following website: https://www.coingecko.com/en/coins/bitcoin/historical_data/usd?start_date=2021-01-01&end_date=2021-09-30. My aim is to get separate lists for the following columns: close, open, volume and date. My problem is, that for the columns volume, open and close, the class name (text-center) is the same (example for the first row):
<th scope="row" class="font-semibold text-center">2021-09-30</th>
<td class="text-center">
$782,626,384,092
</td>
<td class="text-center">
$30,068,690,312
</td>
<td class="text-center">
$41,588
</td>
<td class="text-center">
N/A
</td>
I tried to solve it with the following code but wasn't successful (for the close values):
from bs4 import BeautifulSoup
import requests
import pandas as pd
website = 'https://www.coingecko.com/en/coins/bitcoin/historical_data/usd?start_date=2021-01-01&end_date=2021-09-30#panel'
response = requests.get(website)
soup = BeautifulSoup(response.content, 'html.parser')
results = soup.find('table', {'class':'table-striped'}).find('tbody').find_all('tr')
close = []
volume = []
open = []
date = []
all_tr = soup.find_all('tr')
print('rows:', len(all_tr))
for row in all_tr:
all_td = row.find_all('td', recursive=False)
print('columns:', len(all_td))
for column in all_td:
print(column.text)
close.append(all_td[4].text)
If somebody could help me out, I would be very grateful!
CodePudding user response:
Here is the solution using BeautifulSoup and css selectors.
from bs4 import BeautifulSoup
import requests
import pandas as pd
website = 'https://www.coingecko.com/en/coins/bitcoin/historical_data/usd?start_date=2021-01-01&end_date=2021-09-30#panel'
response = requests.get(website)
soup = BeautifulSoup(response.content, 'html.parser')
results = soup.select('table.table-striped tbody tr')
# close = []
# volume = []
# datum = []
# open = []
data=[]
for result in results:
close = result.select_one('td.text-center:nth-child(5)').get_text(strip=True)
volume = result.select_one('td.text-center:nth-child(3)').get_text(strip=True)
open = result.select_one('td.text-center:nth-child(4)').get_text(strip=True)
date = result.select_one('th[scope="row"]').get_text(strip=True)
data.append([close,volume,open,date])
cols = ["close", "volume","open","datum"]
df = pd.DataFrame(data, columns= cols)
print(df)
Output:
close volume open datum
0 N/A $30,068,690,312 $41,588 2021-09-30
1 $41,588 $29,691,944,223 $41,010 2021-09-29
2 $41,010 $30,483,144,439 $42,247 2021-09-28
3 $42,247 $30,462,815,705 $43,337 2021-09-27
4 $43,337 $30,898,116,660 $42,857 2021-09-26
.. ... ... ... ...
268 $34,082 $74,657,165,356 $31,516 2021-01-05
269 $31,516 $178,894,068,361 $33,008 2021-01-04
270 $33,008 $57,273,436,641 $32,164 2021-01-03
271 $32,164 $34,089,717,988 $29,352 2021-01-02
272 $29,352 $43,503,516,563 $29,022 2021-01-01
[273 rows x 4 columns]
CodePudding user response:
You can do that using pandas as follows:
Code:
import requests
import pandas as pd
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'}
url = "https://www.coingecko.com/en/coins/bitcoin/historical_data/usd?start_date=2021-01-01&end_date=2021-09-30#panel"
req = requests.get(url,headers=headers)
table = pd.read_html(req.text, attrs = {"class":"table-striped"} )
df = table[0]#.to_csv('score.csv',index = False)
print(df)
Output:
Date Market Cap Volume Open Close
0 2021-09-30 $782,626,384,092 $30,068,690,312 $41,588 NaN
1 2021-09-29 $775,534,111,089 $29,691,944,223 $41,010 $41,588
2 2021-09-28 $794,889,951,096 $30,483,144,439 $42,247 $41,010
3 2021-09-27 $825,341,135,636 $30,462,815,705 $43,337 $42,247
4 2021-09-26 $808,279,417,023 $30,898,116,660 $42,857 $43,337
.. ... ... ... ... ...
268 2021-01-05 $585,726,270,249 $74,657,165,356 $31,516 $34,082
269 2021-01-04 $613,616,917,626 $178,894,068,361 $33,008 $31,516
270 2021-01-03 $597,887,713,054 $57,273,436,641 $32,164 $33,008
271 2021-01-02 $545,593,282,215 $34,089,717,988 $29,352 $32,164
272 2021-01-01 $539,438,036,436 $43,503,516,563 $29,022 $29,352
[273 rows x 5 columns]
CodePudding user response:
Here's an answer using defaultdict
. Not sure if you've covered collections
or not:
from collections import defaultdict
from bs4 import BeautifulSoup
import requests
website = 'https://www.coingecko.com/en/coins/bitcoin/historical_data/usd?start_date=2021-01-01&end_date=2021-09-30#panel'
response = requests.get(website)
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find('table', {'class':'table-striped'})
columns = [th.text for th in table.find('thead').find_all('th')]
rows = table.find('tbody').find_all('tr')
data = defaultdict(list)
[data[columns[i]].append(col.text.strip()) for row in rows for i, col in enumerate(row.find_all('td'))]
print(data.keys())
print(data['Date'][:5])
This prints:
dict_keys(['Date', 'Market Cap', 'Volume', 'Open'])
['$782,626,384,092', '$775,534,111,089', '$794,889,951,096', '$825,341,135,636', '$808,279,417,023']
Even if you don't use defaultdict
you can see from this how you need two loops, one to iterate through your rows then an inner loop to iterate through each column within the row. If you wanted to keep it really basic you could eliminate the inner loop with something like:
r = 0
for row in rows:
cols = row.find_all('td')
date[r] = cols[0]
cap[r] = cols[1]
...
r = 1