from urllib.request import urlopen
from bs4 import BeautifulSoup
import numpy as np
table_body = soup.findAll('tbody', class_ = lambda table_rows: table_rows != "thead")
table_data = [[td.getText() for td in table_body[i].findAll('td')]
for i in range(len(table_body))]
I'm working on a project that will scrape data off of https://www.pro-football-reference.com/years/2021/passing.htm. My code to scrape the table headers works however I am having a lot of trouble formatting the table body in a way that will separate player stats into rows. When I run print(table_data)
my result is a one item list that prints the following:
[['Tom Brady*', 'TAM', '44', 'QB', '17', '17', '13-4-0', '485', '719', '67.5', '5316', '43', '6', '12', '1.7', '269', '62', '7.4', '7.8', '11.0', '312.7', '102.1', '68.1', '22', '144', '3', '6.98', '7.41', '3', '5', 'Justin Herbert*', 'LAC', '23', 'QB', '17', '17', '9-8-0', '443', '672', '65.9', '5014', '38', '5.7', '15', '2.2', '256', '72', '7.5', '7.6', '11.3', '294.9', '97.7', '65.6', '31', '214', '4.4', '6.83', '6.95', '5', '5', 'Matthew Stafford', 'LAR', '33', 'QB', '17', '17', '12-5-0', '404', '601', '67.2', '4886', '41', '6.8', '17', '2.8', '233', '79', '8.1', '8.2', '12.1', '287.4', '102.9', '63.8', '30', '243', '4.8', '7.36', '7.45', '3', '4',....]]
How do separate this one item list into multiple lists so that I can achieve my desired output:
[
['Tom Brady*', 'TAM', '44', 'QB', '17', '17', '13-4-0', '485', '719', '67.5', '5316', '43', '6', '12', '1.7', '269', '62', '7.4', '7.8', '11.0', '312.7', '102.1', '68.1', '22', '144', '3', '6.98', '7.41', '3', '5']
['Justin Herbert*', 'LAC', '23', 'QB', '17', '17', '9-8-0', '443', '672', '65.9', '5014', '38', '5.7', '15', '2.2', '256', '72', '7.5', '7.6', '11.3', '294.9', '97.7', '65.6', '31', '214', '4.4', '6.83', '6.95', '5', '5']
['Matthew Stafford', 'LAR', '33', 'QB', '17', '17', '12-5-0', '404', '601', '67.2', '4886', '41', '6.8', '17', '2.8', '233', '79', '8.1', '8.2', '12.1', '287.4', '102.9', '63.8', '30', '243', '4.8', '7.36', '7.45', '3', '4']
['Patrick Mahomes'...]
['Derek Carr'...]
]
CodePudding user response:
Iterate the rows of the table and for each one over its <td>
to get its text:
[[e.text for e in r.select('td')] for row in soup.select('tbody tr')]
Output:
[['Tom Brady*', 'TAM', '44', 'QB', '17', '17', '13-4-0', '485', '719', '67.5', '5316', '43', '6', '12', '1.7', '269', '62', '7.4', '7.8', '11.0', '312.7', '102.1', '68.1', '22', '144', '3', '6.98', '7.41', '3', '5'], ['Justin Herbert*', 'LAC', '23', 'QB', '17', '17', '9-8-0', '443', '672', '65.9', '5014', '38', '5.7', '15', '2.2', '256', '72', '7.5', '7.6', '11.3', '294.9', '97.7', '65.6', '31', '214', '4.4', '6.83', '6.95', '5', '5'], ['Matthew Stafford', 'LAR', '33', 'QB', '17', '17', '12-5-0', '404', '601', '67.2', '4886', '41', '6.8', '17', '2.8', '233', '79', '8.1', '8.2', '12.1', '287.4', '102.9', '63.8', '30', '243', '4.8', '7.36', '7.45', '3', '4'], ['Patrick Mahomes*', 'KAN', '26', 'QB', '17', '17', '12-5-0', '436', '658', '66.3', '4839', '37', '5.6', '13', '2', '260', '75', '7.4', '7.6', '11.1', '284.6', '98.5', '62.2', '28', '146', '4.1', '6.84', '7.07', '3', '3'], ['Derek Carr', 'LVR', '30', 'QB', '17', '17', '10-7-0', '428', '626', '68.4', '4804', '23', '3.7', '14', '2.2', '217', '61', '7.7', '7.4', '11.2', '282.6', '94.0', '52.4', '40', '241', '6', '6.85', '6.60', '3', '6'], ['Joe Burrow', 'CIN', '25', 'QB', '16', '16', '10-6-0', '366', '520', '70.4', '4611', '34', '6.5', '14', '2.7', '202', '82', '8.9', '9.0', '12.6', '288.2', '108.3', '54.3', '51', '370', '8.9', '7.43', '7.51', '2', '3'], ['Dak Prescott', 'DAL', '28', 'QB', '16', '16', '11-5-0', '410', '596', '68.8', '4449', '37', '6.2', '10', '1.7', '227', '51', '7.5', '8.0', '10.9', '278.1', '104.2', '54.6', '30', '144', '4.8', '6.88', '7.34', '1', '2'], ['Josh Allen', 'BUF', '25', 'QB', '17', '17', '11-6-0', '409', '646', '63.3', '4407', '36', '5.6', '15', '2.3', '234', '61', '6.8', '6.9', '10.8', '259.2', '92.2', '60.7', '26', '164', '3.9', '6.31', '6.38', '', ''], ['Kirk Cousins*', 'MIN', '33', 'QB', '16', '16', '8-8-0', '372', '561', '66.3', '4221', '33', '5.9', '7', '1.2', '192', '64', '7.5', '8.1', '11.3', '263.8', '103.1', '52.3', '28', '197', '4.8', '6.83', '7.42', '3', '4'], ['Aaron Rodgers* ', 'GNB', '38', 'QB', '16', '16', '13-3-0', '366', '531', '68.9', '4115', '37', '7', '4', '0.8', '213', '75', '7.7', '8.8', '11.2', '257.2', '111.9', '69.1', '30', '188', '5.3', '7.00', '8.00', '1', '2'], ['Matt Ryan', 'ATL', '36', 'QB', '17', '17', '7-10-0', '375', '560', '67', '3968', '20', '3.6', '12', '2.1', '195', '64', '7.1', '6.8', '10.6', '233.4', '90.4', '46.1', '40', '274', '6.7', '6.16', '5.92', '3', '4'], ['Jimmy Garoppolo', 'SFO', '30', 'QB', '15', '15', '9-6-0', '301', '441', '68.3', '3810', '20', '4.5', '12', '2.7', '172', '83', '8.6', '8.3', '12.7', '254.0', '98.7', '53.3', '29', '201', '6.2', '7.68', '7.38', '3', '3'],...]
Just to point out an alternative with pandas.read_html()
, that would be an easy and common way for that tasks, while using beautifulsoup
under the hood for you.
Example
import pandas as pd
#read the first table from url into dataframe
df = pd.read_html('https://www.pro-football-reference.com/years/2021/passing.htm')[0]
#select only rows that are not subheaders
df[df['Rk'] != 'Rk']
Output
Rk | Player | Tm | Age | Pos | G | GS | QBrec | Cmp | Att | Cmp% | Yds | TD | TD% | Int | Int% | 1D | Lng | Y/A | AY/A | Y/C | Y/G | Rate | QBR | Sk | Yds.1 | Sk% | NY/A | ANY/A | 4QC | GWD |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Tom Brady* | TAM | 44 | QB | 17 | 17 | 13-4-0 | 485 | 719 | 67.5 | 5316 | 43 | 6 | 12 | 1.7 | 269 | 62 | 7.4 | 7.8 | 11 | 312.7 | 102.1 | 68.1 | 22 | 144 | 3 | 6.98 | 7.41 | 3 | 5 |
2 | Justin Herbert* | LAC | 23 | QB | 17 | 17 | 9-8-0 | 443 | 672 | 65.9 | 5014 | 38 | 5.7 | 15 | 2.2 | 256 | 72 | 7.5 | 7.6 | 11.3 | 294.9 | 97.7 | 65.6 | 31 | 214 | 4.4 | 6.83 | 6.95 | 5 | 5 |
3 | Matthew Stafford | LAR | 33 | QB | 17 | 17 | 12-5-0 | 404 | 601 | 67.2 | 4886 | 41 | 6.8 | 17 | 2.8 | 233 | 79 | 8.1 | 8.2 | 12.1 | 287.4 | 102.9 | 63.8 | 30 | 243 | 4.8 | 7.36 | 7.45 | 3 | 4 |
4 | Patrick Mahomes* | KAN | 26 | QB | 17 | 17 | 12-5-0 | 436 | 658 | 66.3 | 4839 | 37 | 5.6 | 13 | 2 | 260 | 75 | 7.4 | 7.6 | 11.1 | 284.6 | 98.5 | 62.2 | 28 | 146 | 4.1 | 6.84 | 7.07 | 3 | 3 |
5 | Derek Carr | LVR | 30 | QB | 17 | 17 | 10-7-0 | 428 | 626 | 68.4 | 4804 | 23 | 3.7 | 14 | 2.2 | 217 | 61 | 7.7 | 7.4 | 11.2 | 282.6 | 94 | 52.4 | 40 | 241 | 6 | 6.85 | 6.6 | 3 | 6 |
6 | Joe Burrow | CIN | 25 | QB | 16 | 16 | 10-6-0 | 366 | 520 | 70.4 | 4611 | 34 | 6.5 | 14 | 2.7 | 202 | 82 | 8.9 | 9 | 12.6 | 288.2 | 108.3 | 54.3 | 51 | 370 | 8.9 | 7.43 | 7.51 | 2 | 3 |
7 | Dak Prescott | DAL | 28 | QB | 16 | 16 | 11-5-0 | 410 | 596 | 68.8 | 4449 | 37 | 6.2 | 10 | 1.7 | 227 | 51 | 7.5 | 8 | 10.9 | 278.1 | 104.2 | 54.6 | 30 | 144 | 4.8 | 6.88 | 7.34 | 1 | 2 |
...