Home > Back-end >  Finding href in "https://www.baseball-reference.com/" webpage using a Python webscraper
Finding href in "https://www.baseball-reference.com/" webpage using a Python webscraper

Time:02-02

I would like to webscrape all of the "boxscore" hyperlinks found in the webpage highlighted in "requests.get" below and have it printed onto an excel spreadsheet. However, the program below prints all the text found under the class "game" from the webpage. What needs to be changed so that it prints only the href-boxscore found within "em" elements under the class "game"?

import requests
from bs4 import BeautifulSoup
import pandas as pd
from openpyxl import load_workbook

wb = load_workbook("tennis_input3.xlsx")
ws = wb.active

response = requests.get('https://www.baseball-reference.com/leagues/majors/2010-schedule.shtml')
webpage = response.content
soup = BeautifulSoup(response.text, "html.parser")
  
col1 = soup.find_all("p", class_="game")

print(pd.DataFrame({"MatchLink":col1}))
df = pd.DataFrame({"MatchLink":col1})

df.to_excel("tennis_3.xlsx", sheet_name="welcome")

CodePudding user response:

Select your elements more specific and as described by your self:

soup.select('p.game em a')

or

soup.select('p.game a[href*=boxes]')

Example

import requests
from bs4 import BeautifulSoup
import pandas as pd

response = requests.get('https://www.baseball-reference.com/leagues/majors/2010-schedule.shtml')
soup = BeautifulSoup(response.text)

pd.DataFrame(
    ['https://www.baseball-reference.com' e.get('href') for e in soup.select('p.game em a')],
    columns = ['url']
)#.to_excel(...)

Output

url
0 https://www.baseball-reference.com/boxes/BOS/BOS201004040.shtml
1 https://www.baseball-reference.com/boxes/ANA/ANA201004050.shtml
2 https://www.baseball-reference.com/boxes/ARI/ARI201004050.shtml
3 https://www.baseball-reference.com/boxes/ATL/ATL201004050.shtml
4 https://www.baseball-reference.com/boxes/CHA/CHA201004050.shtml
...
2457 https://www.baseball-reference.com/boxes/SFN/SFN201010270.shtml
2458 https://www.baseball-reference.com/boxes/SFN/SFN201010280.shtml
2459 https://www.baseball-reference.com/boxes/TEX/TEX201010300.shtml
2460 https://www.baseball-reference.com/boxes/TEX/TEX201010310.shtml
2461 https://www.baseball-reference.com/boxes/TEX/TEX201011010.shtml
  • Related