Scraping with BeautifulSoup - problem with same class names-CodePudding

I'm trying for a school project to scrape data from the following website: https://www.coingecko.com/en/coins/bitcoin/historical_data/usd?start_date=2021-01-01&end_date=2021-09-30. My aim is to get separate lists for the following columns: close, open, volume and date. My problem is, that for the columns volume, open and close, the class name (text-center) is the same (example for the first row):

<th scope="row" class="font-semibold text-center">2021-09-30</th>
<td class="text-center">
$782,626,384,092
</td>
<td class="text-center">
$30,068,690,312
</td>
<td class="text-center">
$41,588
</td>
<td class="text-center">
N/A
</td>

I tried to solve it with the following code but wasn't successful (for the close values):

from bs4 import BeautifulSoup
import requests
import pandas as pd

website = 'https://www.coingecko.com/en/coins/bitcoin/historical_data/usd?start_date=2021-01-01&end_date=2021-09-30#panel'

response = requests.get(website)

soup = BeautifulSoup(response.content, 'html.parser')

results = soup.find('table', {'class':'table-striped'}).find('tbody').find_all('tr')

close = []
volume = []
open = []
date = []

all_tr = soup.find_all('tr')
print('rows:', len(all_tr))

for row in all_tr:
    all_td = row.find_all('td', recursive=False)
    print('columns:', len(all_td))
    for column in all_td:
        print(column.text)

    close.append(all_td[4].text)

If somebody could help me out, I would be very grateful!

CodePudding user response：

Here is the solution using BeautifulSoup and css selectors.

from bs4 import BeautifulSoup
import requests
import pandas as pd

website = 'https://www.coingecko.com/en/coins/bitcoin/historical_data/usd?start_date=2021-01-01&end_date=2021-09-30#panel'

response = requests.get(website)

soup = BeautifulSoup(response.content, 'html.parser')

results = soup.select('table.table-striped tbody tr')

# close = []
# volume = []
# datum = []
# open = []
data=[]
for result in results:
    close = result.select_one('td.text-center:nth-child(5)').get_text(strip=True)
    volume = result.select_one('td.text-center:nth-child(3)').get_text(strip=True)
    open = result.select_one('td.text-center:nth-child(4)').get_text(strip=True)
    date = result.select_one('th[scope="row"]').get_text(strip=True)
    data.append([close,volume,open,date])


cols = ["close", "volume","open","datum"]

df = pd.DataFrame(data, columns= cols)
print(df)

Output:

     close            volume     open       datum
0        N/A   $30,068,690,312  $41,588  2021-09-30
1    $41,588   $29,691,944,223  $41,010  2021-09-29
2    $41,010   $30,483,144,439  $42,247  2021-09-28
3    $42,247   $30,462,815,705  $43,337  2021-09-27
4    $43,337   $30,898,116,660  $42,857  2021-09-26
..       ...               ...      ...         ...
268  $34,082   $74,657,165,356  $31,516  2021-01-05
269  $31,516  $178,894,068,361  $33,008  2021-01-04
270  $33,008   $57,273,436,641  $32,164  2021-01-03
271  $32,164   $34,089,717,988  $29,352  2021-01-02
272  $29,352   $43,503,516,563  $29,022  2021-01-01

[273 rows x 4 columns]

CodePudding user response：

You can do that using pandas as follows:

Code:

import requests
import pandas as pd

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'}

url = "https://www.coingecko.com/en/coins/bitcoin/historical_data/usd?start_date=2021-01-01&end_date=2021-09-30#panel"

req = requests.get(url,headers=headers)

table = pd.read_html(req.text, attrs = {"class":"table-striped"} )

df = table[0]#.to_csv('score.csv',index = False)

print(df)

Output:

 Date        Market Cap            Volume     Open    Close
0    2021-09-30  $782,626,384,092   $30,068,690,312  $41,588      NaN
1    2021-09-29  $775,534,111,089   $29,691,944,223  $41,010  $41,588
2    2021-09-28  $794,889,951,096   $30,483,144,439  $42,247  $41,010
3    2021-09-27  $825,341,135,636   $30,462,815,705  $43,337  $42,247
4    2021-09-26  $808,279,417,023   $30,898,116,660  $42,857  $43,337
..          ...               ...               ...      ...      ...
268  2021-01-05  $585,726,270,249   $74,657,165,356  $31,516  $34,082
269  2021-01-04  $613,616,917,626  $178,894,068,361  $33,008  $31,516
270  2021-01-03  $597,887,713,054   $57,273,436,641  $32,164  $33,008
271  2021-01-02  $545,593,282,215   $34,089,717,988  $29,352  $32,164
272  2021-01-01  $539,438,036,436   $43,503,516,563  $29,022  $29,352

[273 rows x 5 columns]

CodePudding user response：

Here's an answer using defaultdict. Not sure if you've covered collections or not:

from collections import defaultdict
from bs4 import BeautifulSoup
import requests

website = 'https://www.coingecko.com/en/coins/bitcoin/historical_data/usd?start_date=2021-01-01&end_date=2021-09-30#panel'

response = requests.get(website)

soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find('table', {'class':'table-striped'})
columns = [th.text for th in table.find('thead').find_all('th')]
rows = table.find('tbody').find_all('tr')

data = defaultdict(list)
[data[columns[i]].append(col.text.strip()) for row in rows for i, col in enumerate(row.find_all('td'))]

print(data.keys())
print(data['Date'][:5])

This prints:

dict_keys(['Date', 'Market Cap', 'Volume', 'Open'])
['$782,626,384,092', '$775,534,111,089', '$794,889,951,096', '$825,341,135,636', '$808,279,417,023']

Even if you don't use defaultdict you can see from this how you need two loops, one to iterate through your rows then an inner loop to iterate through each column within the row. If you wanted to keep it really basic you could eliminate the inner loop with something like:

r = 0
for row in rows:
   cols = row.find_all('td')
   date[r] = cols[0]
   cap[r] = cols[1]
   ...
   r  = 1