I am trying to scrape data from a website: https://www.oddsportal.com/american-football/usa/nfl/
This link displays the upcoming games.
Currently, I have tried using pandas to read the html data retrieved by Selenium, but the dataframe is a multilevelindex and I'm not sure how to unpack the dataframe into a more readable format.
import undetected_chromedriver.v2 as uc
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import pandas as pd
driver = uc.Chrome()
driver.get("https://www.oddsportal.com/american-football/usa/nfl/")
table = driver.find_element(By.XPATH, '//*[@id="tournamentTable"]').get_attribute('outerHTML')
df = pd.read_html(table)[0]
Out:
American football» USA»NFL ...
Today, 17 Oct ... B's
0 13:30 ... 11
1 17:00 ... 11
2 17:00 ... 11
3 17:00 ... 11
4 17:00 ... 11
5 17:00 ... 11
6 17:00 ... 11
7 17:00 ... 11
8 2Q 6' ... 11
9 2Q 4' ... 11
10 2Q 3' ... 11
11 Tomorrow, 18 Oct ... B's
12 NaN ... NaN
13 00:20 ... 11
14 19 Oct 2021 ... B's
15 NaN ... NaN
16 00:15 ... 11
17 15 Nov 2021 ... B's
18 NaN ... NaN
19 00:20 ... 1
[20 rows x 7 columns]
I can unpack the HTML data using this:
all_matches = [i.text for i in driver.find_elements(By.XPATH, '//*[@id="tournamentTable"]/tbody/tr') if "American football" not in i.text]
Out:
["Today, 17 Oct 1 2 B's",
'',
'13:30 Jacksonville Jaguars - Miami Dolphins\n 23:20\n 110\n-128\n11',
'17:00 Baltimore Ravens - Los Angeles Chargers 34:6\n-159\n 138\n11',
'17:00 Carolina Panthers - Minnesota Vikings 28:34 OT\n 114\n-133\n11',
'17:00 Chicago Bears - Green Bay Packers 14:24\n 195\n-233\n11',
'17:00 Detroit Lions - Cincinnati Bengals 11:34\n 165\n-192\n11',
'17:00 Indianapolis Colts - Houston Texans 31:3\n-556\n 428\n11',
'17:00 New York Giants - Los Angeles Rams 11:38\n 286\n-345\n11',
'17:00 Washington Football Team - Kansas City Chiefs 13:31\n 241\n-294\n11',
"2Q 15'\nCleveland Browns - Arizona Cardinals 14:23\n-149\n 127\n11",
"2Q 14'\nDenver Broncos - Las Vegas Raiders 7:10\n-213\n 181\n11",
"2Q 10'\nNew England Patriots - Dallas Cowboys 14:10\n 163\n-189\n11",
"Tomorrow, 18 Oct 1 2 B's",
'',
'00:20 Pittsburgh Steelers - Seattle Seahawks\n-217\n 185\n11',
"19 Oct 2021 1 2 B's",
'',
'00:15 Tennessee Titans - Buffalo Bills\n 206\n-250\n11',
'',
'',
'']
But this would require me to parse the data using dictionaries and would be messy to format correctly.
My expected output is a DF formatted:
date game_time Team1 Team2 Score 1 2
0 2021-10-17 13:30 Jacksonville Jaguars Miami Dolphins 23:20 110 -128
1 2021-10-17 17:00 Baltimore Ravens Los Angeles 34:6 -159 138
2 2021-10-17 17:00 Carolina Panthers Minnesota Vik 28:34 114 -133
3 2021-10-17 17:00 Chicago Bears Green Bay Pack 14:24 195 -233
4 2021-10-17 17:00 Detroit Lions Cincinnati Ben 11:34 165 -192
I'm hoping there is an easier way to pass the data to the pandas read_HTML function that drops the multilevelindex and gets me even close to the format. If I can get close I can format the rest, but I'd like to avoid using dictionaries, but I understand if that is not possible.
CodePudding user response:
read_html()
is good for basic/primitive table but for complex table you have to write own code which will use for
-loops to work separatelly with rows and cells, and if/else
to recognize what kind of data you have in row.
I use only Selenium for this.
First I get all rows in table @id="tournamentTable"
and later I check classes in every row to detect row with date, row with results or hidden row. And next I run different code for different data.
import selenium.webdriver
import pandas as pd
#import undetected_chromedriver.v2 as uc
#driver = uc.Chrome()
#driver = selenium.webdriver.Chrome()
driver = selenium.webdriver.Firefox()
driver.get('https://www.oddsportal.com/american-football/usa/nfl')
# ---
all_results = []
date = None
all_rows = driver.find_elements_by_xpath('//table[@id="tournamentTable"]//tr')
for row in all_rows:
classes = row.get_attribute('class')
print('classes:', classes)
if classes == 'center nob-border':
date = row.find_element_by_tag_name('span').text.strip()
print('date:', date)
elif (classes == 'table-dummyrow') or ('hidden' in classes):
pass # skip empty rows
else:
if date:
all_cells = row.find_elements_by_xpath('.//td')
print('len(all_cells):', len(all_cells))
teams = all_cells[1].text.split(' - ')
if len(all_cells) == 5:
# row without score
row_values = [
date,
all_cells[0].text.strip(),
teams[0].strip(),
teams[1].strip(),
'',
all_cells[2].text.strip(),
all_cells[3].text.strip(),
all_cells[4].text.strip(),
]
else:
# row with score
row_values = [
date,
all_cells[0].text.strip(),
teams[0].strip(),
teams[1].strip(),
all_cells[2].text.strip(),
all_cells[3].text.strip(),
all_cells[4].text.strip(),
all_cells[5].text.strip(),
]
print('row:', row_values)
all_results.append(row_values)
print('-----------------------')
df = pd.DataFrame(all_results, columns=['date', 'game_time', 'Team1', 'Team2', 'Score', '1', '2', 'B'])
print(df)
Result:
date game_time Team1 Team2 Score 1 2 B
0 Today, 18 Oct 4Q 4 Pittsburgh Steelers Seattle Seahawks 17:17 1.44 2.91 11
1 Tomorrow, 19 Oct 00:15 Tennessee Titans Buffalo Bills 3.13 1.39 10
2 22 Oct 2021 00:20 Cleveland Browns Denver Broncos 1.43 2.93 8
3 24 Oct 2021 17:00 Baltimore Ravens Cincinnati Bengals 1.36 3.30 8
4 24 Oct 2021 17:00 Green Bay Packers Washington Football Team 1.22 4.59 8
5 24 Oct 2021 17:00 Miami Dolphins Atlanta Falcons 1.93 1.91 8
6 24 Oct 2021 17:00 New England Patriots New York Jets 1.33 3.43 8
7 24 Oct 2021 17:00 New York Giants Carolina Panthers 2.28 1.67 8
8 24 Oct 2021 17:00 Tennessee Titans Kansas City Chiefs 2.67 1.49 5
9 24 Oct 2021 20:05 Las Vegas Raiders Philadelphia Eagles 1.63 2.35 8
10 24 Oct 2021 20:05 Los Angeles Rams Detroit Lions 1.09 7.94 8
11 24 Oct 2021 20:25 Arizona Cardinals Houston Texans 1.08 8.98 8
12 24 Oct 2021 20:25 Tampa Bay Buccaneers Chicago Bears 1.14 6.16 8
13 25 Oct 2021 00:20 San Francisco 49ers Indianapolis Colts 1.50 2.70 8