I have scraped information with the results of the 2016 Chess Olympiad, using the following code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
#Imports the HTML into python
url = 'https://www.olimpbase.org/2016/2016te14.html'
requests.get(url)
page = requests.get(url)
print(page)
soup = BeautifulSoup(page.text, 'lxml')
#Subsets the HTML to only get the HTML of our table needed
table = soup.find('table', attrs = {'border': '1'})
print(table)
#Gets all the column headers of our table, but just for the first eleven columns in the webpage
table.find_all('td', class_= 'bog')[1:12]
headers = []
for i in table.find_all('td', class_= 'bog')[1:12]:
title = i.text.strip()
headers.append(title)
#Creates a dataframe using the column headers from our table
df = pd.DataFrame(columns = headers)
table.find_all('tr')[3:] #We grab data since the fourth row; the previous ones belong to the headers.
for j in table.find_all('tr')[3:]:
row_data = j.find_all('td')
row = [tr.text for tr in row_data][0:11]
length = len(df)
df.loc[length] = row
I want to do the same thing for the results of 2014 and 2012 (the Olympics are played every two years normally), authomatically. I have advanced the code half the way, but I really don't know how to continue. This is what I've done so far.
import requests
from bs4 import BeautifulSoup
import pandas as pd
#Imports the HTML into python
url = 'https://www.olimpbase.org/2016/2016te14.html'
requests.get(url)
page = requests.get(url)
print(page)
soup = BeautifulSoup(page.text, 'lxml')
#Subsets the HTML to only get the HTML of our table needed
table = soup.find('table', attrs = {'border': '1'})
print(table)
#Gets all the column headers of our table
table.find_all('td', class_= 'bog')[1:12]
headers = []
for i in table.find_all('td', class_= 'bog')[1:12]:
title = i.text.strip()
headers.append(title)
#Creates a dataframe using the column headers from our table
df = pd.DataFrame(columns = headers)
table.find_all('tr')[3:] #We grab data since the fourth row; the previous ones belong to the headers.
start_year=2012
i=2
end_year=2016
def download_chess(start_year):
url = f'https://www.olimpbase.org/{start_year}/{start_year}te14.html'
response = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
for j in table.find_all('tr')[3:]:
row_data = j.find_all('td')
row = [tr.text for tr in row_data][0:11]
length = len(df)
df.loc[length] = row
while start_year<end_year:
download_chess(start_year)
start_year =i
download_chess(start_year)
I don't have much experience so I don't quite understand the logic of writing filenames. I hope you can help me.
CodePudding user response:
The following will retrieve information for a range of years - in this case, 2000 -- 2018, and save each table to csv as well:
import requests
import pandas as pd
years = range(2000, 2019, 2)
for y in years:
try:
df = pd.read_html(f'https://www.olimpbase.org/{y}/{y}te14.html')[1]
new_header = df.iloc[2]
df = df[3:]
df.columns = new_header
print(df)
df.to_csv(f'chess_olympics_{y}.csv')
except Exception as e:
print(y, 'error', e)
This will print out the results table for each year:
no. | team | Elo | flag | code | pos. | pts | Buch | MP | gms | nan | = | - | nan | = | - | nan | % | Eloav | Elop | ind.medals | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 1 | Russia | 2685 | nan | RUS | 1 | 38 | 457.5 | 20 | 56 | nan | 8 | 4 | 2 | nan | 23 | 30 | 3 | nan | 67.9 | 2561 | 2694 | 1 - 0 - 2 |
4 | 2 | Germany | 2604 | nan | GER | 2 | 37 | 455.5 | 22 | 56 | nan | 10 | 2 | 2 | nan | 21 | 32 | 3 | nan | 66.1 | 2568 | 2685 | 0 - 0 - 2 |
5 | 3 | Ukraine | 2638 | nan | UKR | 3 | 35½ | 457.5 | 21 | 56 | nan | 8 | 5 | 1 | nan | 18 | 35 | 3 | nan | 63.4 | 2558 | 2653 | 1 - 0 - 0 |
6 | 4 | Hungary | 2661 | nan | HUN | 4 | 35½ | 455.5 | 21 | 56 | nan | 8 | 5 | 1 | nan | 22 | 27 | 7 | nan | 63.4 | 2570 | 2665 | 0 - 0 - 0 |
7 | 5 | Israel | 2652 | nan | ISR | 5 | 34½ | 463.5 | 20 | 56 | nan | 7 | 6 | 1 | nan | 17 | 35 | 4 | nan | 61.6 | 2562 | 2649 | 0 - 0 - 0 |
[...]
Relevant documentation for pandas: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html