How to toggle box when web scraping?-CodePudding

I am trying to scrape the site

There is a toggle that says "When table is sorted, hide non-qualifiers". When that is toggled, the output should look something more like [[Lamar Jackson, BAL, 24, qb, 7, 7, 76, 480, 2, 29, 31, 6.3, 68.6, 5], [Jalen Hurts, PHI, 23, qb 8, 8, 73, 432, 5, 29, 27, 5.9, 54.0, 5]...]. But when it's not toggled, it looks like the output I posted above.

However, when you scrape the website it defaults to off. Is there a way to toggle this to be on?

CodePudding user response：

You could target the table by id, then exclude rows where there are tds having a class of non_qual. I would use the html from these rows, wrapped with table tags, to reconstitute the table with pandas. Finally, sort and tidy the table.

Given there appear to be some ties, within Y/A, it looks like there is a secondary sort on Att desc e.g. page output the following order for Y/A 4.5 (current 2021-11-07)

18  Aaron Jones 
19  Dalvin Cook 
20  Melvin Gordon   
21  David Montgomery

Code:

import pandas as pd
from bs4 import BeautifulSoup as bs
import requests

r = requests.get(
    'https://www.pro-football-reference.com/years/2021/rushing.htm')

soup = bs(r.content, 'lxml')

t = pd.read_html('<table>'   ''.join([str(r) for r in soup.select(
    '#rushing tr:not(:has(td.non_qual))')])   '</table>')[0]

t.columns = [i[1] for i in t.columns]
t = t[t.Rk != 'Rk'].apply(pd.to_numeric, errors="ignore")
t.sort_values(['Y/A', 'Att'], ascending=[False, False], inplace=True)
t.Rk = [i 1 for i in range(len(t.index))]
t.reset_index(drop=True, inplace=True)

t

Sample output: