Home > Enterprise >  how loop a scrap in different pages with Python
how loop a scrap in different pages with Python

Time:09-22

So, my objective with this code is extract informations about a rookie in my NFL team and compare his performance with players that finished last season in top 10 statistics in his position when they was in first season where they played more than 10 games.

For that i used as reference a post in Towards DS, where they explained how did scrap of nfl data.

So, my final code where i'm with problems is looking like this

#url page
url_mac = 'https://www.pro-football-reference.com/years/2021/passing.htm'

#opening URL with BS
html_mac = urlopen(url_mac)
stats_macpage = BeautifulSoup(html_mac)

#collecting table rows
column_headers = stats_macpage.findAll('tr')[0]
column_headers = [i.getText() for i in column_headers.findAll('th')]


#getting stats of each row
rows = stats_macpage.findAll('tr')[1:]
qb_stats = []
for i in range(len(rows)):
    qb_stats.append([col.getText() for col in rows[i].findAll('td')])
    
#creating a data frame
data = pd.DataFrame(qb_stats, columns = column_headers[1:])

#rename column of sack yards from yards to y_sack

new_columns = data.columns.values
new_columns[-6] = 'y_sack'
data.columns = new_columns

#selecting specifics stats
categories = ['Cmp%', 'Yds', 'TD', 'Int', 'Y/A', 'Rate']

#first filter
data_radar = data[['Player', 'Tm']   categories]

#selecting specific player 
data_mac = data_radar[data_radar['Player'] == 'Mac Jones']

I did it for all 11 players that i want the data and i had concatenated in the final, but you can imagine how bad my code are looking.

How can i improve it to create a loop ? I already tried somethings but all of these ideas dont worked well or was above my ability to create a good code. Between my ideas was take all data of the last 20 years and try find year by year, but that look a bit unecessary, cause i already know what years i want. My problem are specifically in this part, cause in the final i could create a list and then a "if" and only take the first year of each player in my list played > 10 games.

url_mac = 'https://www.pro-football-reference.com/years/2021/passing.htm'

Thank you for all.

CodePudding user response:

To load all tables from years 2011-2021 to one dataframe, you can use this example:

import pandas as pd

# url page
url_mac = "https://www.pro-football-reference.com/years/{}/passing.htm"

all_dfs = []
for year in range(2011, 2022):
    print("Getting table for year {}".format(year))
    df = pd.read_html(url_mac.format(year))[0]
    df = df[~df["Rk"].eq("Rk")]
    df["Year"] = year
    all_dfs.append(df)

df = pd.concat(all_dfs)
print(df)
df.to_csv("data.csv", index=False)

Prints:

      Rk                 Player   Tm Age  Pos   G  GS   QBrec  Cmp  Att   Cmp%   Yds  TD    TD% Int   Int%   1D Lng   Y/A   AY/A   Y/C    Y/G   Rate    QBR  Sk Yds.1   NY/A   ANY/A   Sk%  4QC  GWD  Year
0      1            Drew Brees*  NOR  32   QB  16  16  13-3-0  468  657   71.2  5476  46    7.0  14    2.1  279  79   8.3    8.8  11.7  342.3  110.6   82.3  24   158   7.81    8.23   3.5    3    4  2011
1      2             Tom Brady*  NWE  34   QB  16  16  13-3-0  401  611   65.6  5235  39    6.4  12    2.0  262  99   8.6    9.0  13.1  327.2  105.6   73.8  32   173   7.87    8.25   5.0    1    2  2011
2      3       Matthew Stafford  DET  23   QB  16  16  10-6-0  421  663   63.5  5038  41    6.2  16    2.4  243  73   7.6    7.7  12.0  314.9   97.2   60.5  36   257   6.84    6.98   5.2    3    4  2011
3      4           Eli Manning*  NYG  30   QB  16  16   9-7-0  359  589   61.0  4933  29    4.9  16    2.7  218  99   8.4    8.1  13.7  308.3   92.9   64.2  28   199   7.67    7.45   4.5    5    6  2011


...

and saves data.csv (screenshot from LibreOffice):

enter image description here

CodePudding user response:

Go with pandas to parse those tables. Just loop through the seasons in the url (don't forget to add that in the dataframes otherwise you won;t know which rows are for what seasons:

import pandas as pd

# To get last 10 seasons
years_look_back = 10
years_list = list(range(2021,2021-(years_look_back 1),-1))
years_list.reverse()


# Or just make a list of specific years:
years_list = [2001, 2005, 2008, 2010, 2011, 2015]

cols = ['Player', 'Tm','Cmp%', 'Yds', 'TD', 'Int', 'Y/A', 'Rate']
df_list = []
for year in years_list:
    url_mac = f'https://www.pro-football-reference.com/years/{year}/passing.htm'
    temp_df = pd.read_html(url_mac)[0][cols]
    temp_df['Season'] = year
    temp_df = temp_df[~temp_df["Player"].eq("Player")]
    df_list.append(temp_df)
    print(f'Collect: {year}')


data_radar = pd.concat(df_list)

#selecting specific player 
data_mac = data_radar[data_radar['Player'] == 'Mac Jones']
  • Related