I am trying to scrape the site
There is a toggle that says "When table is sorted, hide non-qualifiers". When that is toggled, the output should look something more like [[Lamar Jackson, BAL, 24, qb, 7, 7, 76, 480, 2, 29, 31, 6.3, 68.6, 5], [Jalen Hurts, PHI, 23, qb 8, 8, 73, 432, 5, 29, 27, 5.9, 54.0, 5]...]
. But when it's not toggled, it looks like the output I posted above.
However, when you scrape the website it defaults to off. Is there a way to toggle this to be on?
CodePudding user response:
You could target the table by id, then exclude rows where there are tds
having a class of non_qual
. I would use the html from these rows, wrapped with table
tags, to reconstitute the table with pandas. Finally, sort and tidy the table.
Given there appear to be some ties, within Y/A
, it looks like there is a secondary sort on Att
desc e.g. page output the following order for Y/A 4.5 (current 2021-11-07)
18 Aaron Jones
19 Dalvin Cook
20 Melvin Gordon
21 David Montgomery
Code:
import pandas as pd
from bs4 import BeautifulSoup as bs
import requests
r = requests.get(
'https://www.pro-football-reference.com/years/2021/rushing.htm')
soup = bs(r.content, 'lxml')
t = pd.read_html('<table>' ''.join([str(r) for r in soup.select(
'#rushing tr:not(:has(td.non_qual))')]) '</table>')[0]
t.columns = [i[1] for i in t.columns]
t = t[t.Rk != 'Rk'].apply(pd.to_numeric, errors="ignore")
t.sort_values(['Y/A', 'Att'], ascending=[False, False], inplace=True)
t.Rk = [i 1 for i in range(len(t.index))]
t.reset_index(drop=True, inplace=True)
t
Sample output: