I'm very new to programming and i've been trying to teach myself some principles of web scraping with baseball data. In the following example, i'm trying to scrape data from CBS Sports related to baseball game team matchups, game time, and probable pitchers. I've had no problem getting the team matchups and game time to show up, but the probable pitchers returns "None".
from bs4 import BeautifulSoup as Soup
import requests
import pandas as pd
from pandas import DataFrame
matchups_response=requests.get('https://www.cbssports.com/mlb/schedule/',"lxml")
matchups_soup=Soup(matchups_response.text)
matchups_tables=matchups_soup.find_all('table')
#len(matchups_tables)
matchups_tables=matchups_tables[0]
rows=matchups_tables.find_all('tr')
first_data_row=rows[1]
first_data_row.find_all(True, {'class':['CellPlayerName--short']})
[str(x.string) for x in first_data_row.find_all(True, {'class':['CellPlayerName--short']})]
def parse_row(row): return [str(x.string) for x in row.find_all(True, {'class':['CellPlayerName--short']})]
list_of_parsed_rows=[parse_row(row) for row in rows[1:31]]
dfPitchers=DataFrame(list_of_parsed_rows)
print(dfPitchers)
And this is what it returns:
0 1
0 None None
1 None None
2 None None
3 None None
4 None None
5 None None
6 None None
7 None None
8 None None
9 None None
10 None None
11 None None
When i use similar code and refer to {'class':['TeamName']}) OR {'class':['CellGame']})] i get a correct output:
0 1
0 Washington Houston
1 Boston Pittsburgh
2 Minnesota Tampa Bay
3 Philadelphia N.Y. Yankees
4 Milwaukee Cleveland
5 Cincinnati Texas
6 Arizona Chi. Cubs
7 San Diego San Francisco
8 Kansas City Seattle
9 L.A. Angels Colorado
10 N.Y. Mets Miami
11 Oakland L.A. Dodgers
0 WAS 0, HOU 0 - 1st
1 BOS 0, PIT 0 - 1st
2 1:05 pm
3 1:05 pm
4 4:05 pm
5 4:05 pm
6 4:05 pm
7 4:05 pm
8 4:10 pm
9 4:10 pm
10 6:40 pm
11 9:05 pm
But for {'class':['CellPlayerName--short']})] it always returns None. Any help would be appreciated. Apologies in advance, I'm very much a novice, but i've searched and searched for this and can't find a solution i can make work. Thanks!
CodePudding user response:
from the docs If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None
Instead of .string
use .text
/ .get_text()
to get your result:
def parse_row(row): return [x.text for x in row.find_all(True, {'class':['CellPlayerName--short']})]
and select more specific, if you online want to get value from <a>
:
def parse_row(row): return [x.a.text for x in row.find_all(True, {'class':['CellPlayerName--short']})]
Output
0 | 1 |
---|---|
J. Verlander | C. Edwards |
M. Keller | N. Pivetta |
D. Rasmussen | B. Ober |
C. Schmidt | A. Nola |
C. Quantrill | B. Woodruff |
S. Howard | R. Sanmartin |
J. Steele | Z. Davies |
C. Rodon | M. Clevinger |
L. Gilbert | D. Lynch |
A. Senzatela | J. Suarez |
P. Lopez | C. Bassitt |
T. Gonsolin | S. Manaea |