Home > database >  How do you drop a header from a Pandas Dataframe formed by Scraping a Table using Beautifulsoup? (Py
How do you drop a header from a Pandas Dataframe formed by Scraping a Table using Beautifulsoup? (Py

Time:03-09

I scraped a table from pro-football-reference and created a Dataframe but seem to be running into an issue due to the need to convert the html to a string.

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
rb_r = requests.get('https://www.pro-football-reference.com/years/2021/rushing.htm')
rb_webpage = bs(rb_r.content, features='lxml')
rb_table = rb_webpage.find('table', attrs={'id': 'rushing'})
rb_df = pd.read_html(str(rb_table))[0]
print(rb_df.head().to_string())

Output:

  Unnamed: 0_level_0 Unnamed: 1_level_0 Unnamed: 2_level_0 Unnamed: 3_level_0 Unnamed: 4_level_0 Games     Rushing                                Unnamed: 14_level_0
                  Rk             Player                 Tm                Age                Pos     G  GS     Att   Yds  TD   1D Lng  Y/A    Y/G                 Fmb
0                  1  Jonathan Taylor*                 IND                 22                 RB    17  17     332  1811  18  107  83  5.5  106.5                   4
1                  2      Najee Harris*                PIT                 23                 RB    17  17     307  1200   7   62  37  3.9   70.6                   0
2                  3         Joe Mixon*                CIN                 25                 RB    16  16     292  1205  13   60  32  4.1   75.3                   2
3                  4     Antonio Gibson                WAS                 23                 RB    16  14     258  1037   7   65  27  4.0   64.8                   6
4                  5       Dalvin Cook*                MIN                 26                 RB    13  13     249  1159   6   57  66  4.7   89.2  

I'm trying to remove the "Unnamed: 0_level_0..." header but everything I try hasn't worked. Thanks in advance!

CodePudding user response:

You're near to your goal, just add the header parameter to pandas.read_html() to select the correct one:

pd.read_html(str(rb_table), header=1)[0]

Example

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
rb_r = requests.get('https://www.pro-football-reference.com/years/2021/rushing.htm')
rb_webpage = bs(rb_r.content, features='lxml')
rb_table = rb_webpage.find('table', attrs={'id': 'rushing'})
rb_df = pd.read_html(str(rb_table), header=1)[0]
print(rb_df.head().to_string())

Output

Rk Player Tm Age Pos G GS Att Yds TD 1D Lng Y/A Y/G Fmb
0 1 Jonathan Taylor* IND 22 RB 17 17 332 1811 18 107 83 5.5 106.5 4
1 2 Najee Harris* PIT 23 RB 17 17 307 1200 7 62 37 3.9 70.6 0
2 3 Joe Mixon* CIN 25 RB 16 16 292 1205 13 60 32 4.1 75.3 2
3 4 Antonio Gibson WAS 23 RB 16 14 258 1037 7 65 27 4 64.8 6
4 5 Dalvin Cook* MIN 26 RB 13 13 249 1159 6 57 66 4.7 89.2 3
  • Related