Home > Software engineering >  Unpacking pands read_HTML dataframe
Unpacking pands read_HTML dataframe

Time:10-18

I am trying to scrape data from a website: https://www.oddsportal.com/american-football/usa/nfl/

This link displays the upcoming games.

Currently, I have tried using pandas to read the html data retrieved by Selenium, but the dataframe is a multilevelindex and I'm not sure how to unpack the dataframe into a more readable format.

import undetected_chromedriver.v2 as uc
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import pandas as pd



driver = uc.Chrome()
driver.get("https://www.oddsportal.com/american-football/usa/nfl/")

table = driver.find_element(By.XPATH, '//*[@id="tournamentTable"]').get_attribute('outerHTML')

df = pd.read_html(table)[0]

Out:
   American football» USA»NFL  ...     
                Today, 17 Oct  ...  B's
0                       13:30  ...   11
1                       17:00  ...   11
2                       17:00  ...   11
3                       17:00  ...   11
4                       17:00  ...   11
5                       17:00  ...   11
6                       17:00  ...   11
7                       17:00  ...   11
8                       2Q 6'  ...   11
9                       2Q 4'  ...   11
10                      2Q 3'  ...   11
11           Tomorrow, 18 Oct  ...  B's
12                        NaN  ...  NaN
13                      00:20  ...   11
14                19 Oct 2021  ...  B's
15                        NaN  ...  NaN
16                      00:15  ...   11
17                15 Nov 2021  ...  B's
18                        NaN  ...  NaN
19                      00:20  ...    1

[20 rows x 7 columns]

I can unpack the HTML data using this:

all_matches = [i.text for i in driver.find_elements(By.XPATH, '//*[@id="tournamentTable"]/tbody/tr') if "American football" not in i.text]

Out:
["Today, 17 Oct 1 2 B's",
 '',
 '13:30 Jacksonville Jaguars - Miami Dolphins\n  23:20\n 110\n-128\n11',
 '17:00 Baltimore Ravens - Los Angeles Chargers 34:6\n-159\n 138\n11',
 '17:00 Carolina Panthers - Minnesota Vikings 28:34 OT\n 114\n-133\n11',
 '17:00 Chicago Bears - Green Bay Packers 14:24\n 195\n-233\n11',
 '17:00 Detroit Lions - Cincinnati Bengals 11:34\n 165\n-192\n11',
 '17:00 Indianapolis Colts - Houston Texans 31:3\n-556\n 428\n11',
 '17:00 New York Giants - Los Angeles Rams 11:38\n 286\n-345\n11',
 '17:00 Washington Football Team - Kansas City Chiefs 13:31\n 241\n-294\n11',
 "2Q 15'\nCleveland Browns - Arizona Cardinals 14:23\n-149\n 127\n11",
 "2Q 14'\nDenver Broncos - Las Vegas Raiders 7:10\n-213\n 181\n11",
 "2Q 10'\nNew England Patriots - Dallas Cowboys 14:10\n 163\n-189\n11",
 "Tomorrow, 18 Oct 1 2 B's",
 '',
 '00:20 Pittsburgh Steelers - Seattle Seahawks\n-217\n 185\n11',
 "19 Oct 2021 1 2 B's",
 '',
 '00:15 Tennessee Titans - Buffalo Bills\n 206\n-250\n11',
 '',
 '',
 '']

But this would require me to parse the data using dictionaries and would be messy to format correctly.

My expected output is a DF formatted:

    date         game_time          Team1              Team2         Score     1     2   
0   2021-10-17    13:30      Jacksonville Jaguars   Miami Dolphins  23:20     110  -128
1   2021-10-17    17:00      Baltimore Ravens       Los Angeles     34:6     -159   138
2   2021-10-17    17:00      Carolina Panthers      Minnesota Vik   28:34     114  -133
3   2021-10-17    17:00      Chicago Bears          Green Bay Pack  14:24     195  -233
4   2021-10-17    17:00      Detroit Lions          Cincinnati Ben  11:34     165  -192

I'm hoping there is an easier way to pass the data to the pandas read_HTML function that drops the multilevelindex and gets me even close to the format. If I can get close I can format the rest, but I'd like to avoid using dictionaries, but I understand if that is not possible.

CodePudding user response:

read_html() is good for basic/primitive table but for complex table you have to write own code which will use for-loops to work separatelly with rows and cells, and if/else to recognize what kind of data you have in row.

I use only Selenium for this.

First I get all rows in table @id="tournamentTable" and later I check classes in every row to detect row with date, row with results or hidden row. And next I run different code for different data.

import selenium.webdriver
import pandas as pd

#import undetected_chromedriver.v2 as uc
#driver = uc.Chrome()

#driver = selenium.webdriver.Chrome()
driver = selenium.webdriver.Firefox()
driver.get('https://www.oddsportal.com/american-football/usa/nfl')  

# --- 

all_results = []
date = None

all_rows = driver.find_elements_by_xpath('//table[@id="tournamentTable"]//tr')

for row in all_rows:
    classes = row.get_attribute('class')
    print('classes:', classes)
    
    if classes == 'center nob-border':
        date = row.find_element_by_tag_name('span').text.strip()
        print('date:', date)
    elif (classes == 'table-dummyrow') or ('hidden' in classes):
        pass  # skip empty rows
    else:
        if date:
            all_cells = row.find_elements_by_xpath('.//td')
            print('len(all_cells):', len(all_cells))
            teams = all_cells[1].text.split(' - ')
            if len(all_cells) == 5: 
                # row without score
                row_values = [
                    date,
                    all_cells[0].text.strip(),
                    teams[0].strip(),
                    teams[1].strip(),
                    '',
                    all_cells[2].text.strip(),
                    all_cells[3].text.strip(),
                    all_cells[4].text.strip(),
                ]
            else: 
                # row with score
                row_values = [
                    date,
                    all_cells[0].text.strip(),
                    teams[0].strip(),
                    teams[1].strip(),
                    all_cells[2].text.strip(),
                    all_cells[3].text.strip(),
                    all_cells[4].text.strip(),
                    all_cells[5].text.strip(),
                ]

            print('row:', row_values)
            all_results.append(row_values)
            
print('-----------------------')

df = pd.DataFrame(all_results, columns=['date', 'game_time', 'Team1', 'Team2', 'Score', '1', '2', 'B'])

print(df)

Result:

                date game_time                 Team1                     Team2  Score     1     2   B
0      Today, 18 Oct      4Q 4   Pittsburgh Steelers          Seattle Seahawks  17:17  1.44  2.91  11
1   Tomorrow, 19 Oct     00:15      Tennessee Titans             Buffalo Bills         3.13  1.39  10
2        22 Oct 2021     00:20      Cleveland Browns            Denver Broncos         1.43  2.93   8
3        24 Oct 2021     17:00      Baltimore Ravens        Cincinnati Bengals         1.36  3.30   8
4        24 Oct 2021     17:00     Green Bay Packers  Washington Football Team         1.22  4.59   8
5        24 Oct 2021     17:00        Miami Dolphins           Atlanta Falcons         1.93  1.91   8
6        24 Oct 2021     17:00  New England Patriots             New York Jets         1.33  3.43   8
7        24 Oct 2021     17:00       New York Giants         Carolina Panthers         2.28  1.67   8
8        24 Oct 2021     17:00      Tennessee Titans        Kansas City Chiefs         2.67  1.49   5
9        24 Oct 2021     20:05     Las Vegas Raiders       Philadelphia Eagles         1.63  2.35   8
10       24 Oct 2021     20:05      Los Angeles Rams             Detroit Lions         1.09  7.94   8
11       24 Oct 2021     20:25     Arizona Cardinals            Houston Texans         1.08  8.98   8
12       24 Oct 2021     20:25  Tampa Bay Buccaneers             Chicago Bears         1.14  6.16   8
13       25 Oct 2021     00:20   San Francisco 49ers        Indianapolis Colts         1.50  2.70   8

  • Related