Home > Mobile >  Scrape web with info from several years and create a csv file for each year
Scrape web with info from several years and create a csv file for each year

Time:08-10

I have scraped information with the results of the 2016 Chess Olympiad, using the following code:

import requests
from bs4 import BeautifulSoup
import pandas as pd

#Imports the HTML into python
url = 'https://www.olimpbase.org/2016/2016te14.html'
requests.get(url)
page = requests.get(url)
print(page)
soup = BeautifulSoup(page.text, 'lxml')

#Subsets the HTML to only get the HTML of our table needed
table = soup.find('table', attrs = {'border': '1'})
print(table)

#Gets all the column headers of our table, but just for the first eleven columns in the webpage
table.find_all('td', class_= 'bog')[1:12]
headers = []
for i in table.find_all('td', class_= 'bog')[1:12]:
    title = i.text.strip()
    headers.append(title)

#Creates a dataframe using the column headers from our table
df = pd.DataFrame(columns = headers)

table.find_all('tr')[3:] #We grab data since the fourth row; the previous ones belong to the headers.

for j in table.find_all('tr')[3:]:
    row_data = j.find_all('td')
    row = [tr.text for tr in row_data][0:11]
    length = len(df)
    df.loc[length] = row

I want to do the same thing for the results of 2014 and 2012 (the Olympics are played every two years normally), authomatically. I have advanced the code half the way, but I really don't know how to continue. This is what I've done so far.

import requests
from bs4 import BeautifulSoup
import pandas as pd

#Imports the HTML into python
url = 'https://www.olimpbase.org/2016/2016te14.html'
requests.get(url)
page = requests.get(url)
print(page)
soup = BeautifulSoup(page.text, 'lxml')

#Subsets the HTML to only get the HTML of our table needed
table = soup.find('table', attrs = {'border': '1'})
print(table)

#Gets all the column headers of our table
table.find_all('td', class_= 'bog')[1:12]
headers = []
for i in table.find_all('td', class_= 'bog')[1:12]:
    title = i.text.strip()
    headers.append(title)

#Creates a dataframe using the column headers from our table
df = pd.DataFrame(columns = headers)

table.find_all('tr')[3:] #We grab data since the fourth row; the previous ones belong to the headers. 

start_year=2012
i=2
end_year=2016 

def download_chess(start_year):
    url = f'https://www.olimpbase.org/{start_year}/{start_year}te14.html'
    response = requests.get(url)
    soup = BeautifulSoup(page.text, 'lxml')
    for j in table.find_all('tr')[3:]:
        row_data = j.find_all('td')
        row = [tr.text for tr in row_data][0:11]
        length = len(df)
        df.loc[length] = row

while start_year<end_year:
    download_chess(start_year)
    start_year =i    

download_chess(start_year)

I don't have much experience so I don't quite understand the logic of writing filenames. I hope you can help me.

CodePudding user response:

The following will retrieve information for a range of years - in this case, 2000 -- 2018, and save each table to csv as well:

import requests
import pandas as pd

years = range(2000, 2019, 2)

for y in years:
    try:
        df = pd.read_html(f'https://www.olimpbase.org/{y}/{y}te14.html')[1]
        new_header = df.iloc[2]
        df = df[3:]
        df.columns = new_header
        print(df)
        df.to_csv(f'chess_olympics_{y}.csv')
    except Exception as e:
        print(y, 'error', e)

This will print out the results table for each year:

no. team Elo flag code pos. pts Buch MP gms nan = - nan = - nan % Eloav Elop ind.medals
3 1 Russia 2685 nan RUS 1 38 457.5 20 56 nan 8 4 2 nan 23 30 3 nan 67.9 2561 2694 1 - 0 - 2
4 2 Germany 2604 nan GER 2 37 455.5 22 56 nan 10 2 2 nan 21 32 3 nan 66.1 2568 2685 0 - 0 - 2
5 3 Ukraine 2638 nan UKR 3 35½ 457.5 21 56 nan 8 5 1 nan 18 35 3 nan 63.4 2558 2653 1 - 0 - 0
6 4 Hungary 2661 nan HUN 4 35½ 455.5 21 56 nan 8 5 1 nan 22 27 7 nan 63.4 2570 2665 0 - 0 - 0
7 5 Israel 2652 nan ISR 5 34½ 463.5 20 56 nan 7 6 1 nan 17 35 4 nan 61.6 2562 2649 0 - 0 - 0

[...]

Relevant documentation for pandas: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html

  • Related