Home > Enterprise >  Python web scraping for a specific class returns none
Python web scraping for a specific class returns none

Time:03-30

I'm very new to programming and i've been trying to teach myself some principles of web scraping with baseball data. In the following example, i'm trying to scrape data from CBS Sports related to baseball game team matchups, game time, and probable pitchers. I've had no problem getting the team matchups and game time to show up, but the probable pitchers returns "None".

from bs4 import BeautifulSoup as Soup
import requests
import pandas as pd
from pandas import DataFrame

matchups_response=requests.get('https://www.cbssports.com/mlb/schedule/',"lxml")

matchups_soup=Soup(matchups_response.text)

matchups_tables=matchups_soup.find_all('table')

#len(matchups_tables)

matchups_tables=matchups_tables[0]

rows=matchups_tables.find_all('tr')

first_data_row=rows[1]

first_data_row.find_all(True, {'class':['CellPlayerName--short']})

[str(x.string) for x in first_data_row.find_all(True, {'class':['CellPlayerName--short']})]

def parse_row(row): return [str(x.string) for x in row.find_all(True, {'class':['CellPlayerName--short']})]

list_of_parsed_rows=[parse_row(row) for row in rows[1:31]]

dfPitchers=DataFrame(list_of_parsed_rows)

print(dfPitchers)

And this is what it returns:

       0     1
0   None  None
1   None  None
2   None  None
3   None  None
4   None  None
5   None  None
6   None  None
7   None  None
8   None  None
9   None  None
10  None  None
11  None  None

When i use similar code and refer to {'class':['TeamName']}) OR {'class':['CellGame']})] i get a correct output:

               0              1
0     Washington        Houston
1         Boston     Pittsburgh
2      Minnesota      Tampa Bay
3   Philadelphia   N.Y. Yankees
4      Milwaukee      Cleveland
5     Cincinnati          Texas
6        Arizona      Chi. Cubs
7      San Diego  San Francisco
8    Kansas City        Seattle
9    L.A. Angels       Colorado
10     N.Y. Mets          Miami
11       Oakland   L.A. Dodgers

0   WAS 0, HOU 0 - 1st
1   BOS 0, PIT 0 - 1st
2              1:05 pm
3              1:05 pm
4              4:05 pm
5              4:05 pm
6              4:05 pm
7              4:05 pm
8              4:10 pm
9              4:10 pm
10             6:40 pm
11             9:05 pm

But for {'class':['CellPlayerName--short']})] it always returns None. Any help would be appreciated. Apologies in advance, I'm very much a novice, but i've searched and searched for this and can't find a solution i can make work. Thanks!

CodePudding user response:

from the docs If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None

Instead of .string use .text / .get_text() to get your result:

def parse_row(row): return [x.text for x in row.find_all(True, {'class':['CellPlayerName--short']})]

and select more specific, if you online want to get value from <a>:

def parse_row(row): return [x.a.text for x in row.find_all(True, {'class':['CellPlayerName--short']})]

Output

0 1
J. Verlander C. Edwards
M. Keller N. Pivetta
D. Rasmussen B. Ober
C. Schmidt A. Nola
C. Quantrill B. Woodruff
S. Howard R. Sanmartin
J. Steele Z. Davies
C. Rodon M. Clevinger
L. Gilbert D. Lynch
A. Senzatela J. Suarez
P. Lopez C. Bassitt
T. Gonsolin S. Manaea
  • Related