Cannot find the table data within the soup, but I know its there-CodePudding

I am trying create a function that scrapes college baseball team roster pages for a project. And I have created a function that crawls the roster page, gets a list of the links I want to scrape. But when I try to scrape the individual links for each player, it works but cannot find the data that is on their page.

This is the link to the page I am crawling from at the start:

https://gvsulakers.com/sports/baseball/roster

These are just functions that I call within the function that I am having a problem with:

def parse_row(rows):
    return [str(x.string)for x in rows.find_all('td')]

def scrape(url):
  page = requests.get(url, headers = headers)
  html = page.text
  soop = BeautifulSoup(html, 'lxml')
  return(soop)

def find_data(url):
  page = requests.get(url, headers = headers)
  html = page.text
  soop = BeautifulSoup(html, 'lxml')
  row = soop.find_all('tr')
  lopr = [parse_row(rows) for rows in row]
  return(lopr)

Here is what I am having an issue with. when I assign type1_roster with a variable and print it, i only get an empty list. Ideally it should contain data about a player or players from a players roster page.

# Roster page crawler
def type1_roster(team_id):
  url = "https://"   team_id   ".com/sports/baseball/roster"
  soop = scrape(url)
  href_tags = soop.find_all(href = True)
  hrefs = [tag.get('href') for tag in href_tags]
  # get all player links
  player_hrefs = []
  for href in hrefs:
    if 'sports/baseball/roster' in href:
      if 'sports/baseball/roster/coaches' not in href:
        if 'https:' not in href:
          player_hrefs.append(href)
  # get rid of duplicates
  player_links = list(set(player_hrefs))
  # scrape the roster links
  for link in player_links:
    player_ = url   link[24:]
    return(find_data(player_))

CodePudding user response：

A number of things:

I would pass the headers as a global
You are slicing 1 character too late the link I think for player_
You need to re-work the logic of find_data(), as data is present in a mixture of element types and not in table/tr/td elements e.g. found in spans. The html attributes are nice and descriptive and will support targeting content easily
You can target the player links from the landing page more tightly with the css selector list shown below. This removes the need for multiple loops as well as the use of list(set())

import requests
from bs4 import BeautifulSoup

HEADERS = {'User-Agent': 'Mozilla/5.0'}


def scrape(url):
    page = requests.get(url, headers=HEADERS)
    html = page.text
    soop = BeautifulSoup(html, 'lxml')
    return(soop)


def find_data(url):
    page = requests.get(url, headers=HEADERS)
    #print(page)
    html = page.text
    soop = BeautifulSoup(html, 'lxml')
    # re-think logic here to return desired data e.g.
    # soop.select_one('.sidearm-roster-player-jersey-number').text
    first_name = soop.select_one('.sidearm-roster-player-first-name').text
    # soop.select_one('.sidearm-roster-player-last-name').text
    # need targeted string cleaning possibly
    bio = soop.select_one('#sidearm-roster-player-bio').get_text('')
    return (first_name, bio)


def type1_roster(team_id):
    url = "https://"   team_id   ".com/sports/baseball/roster"
    soop = scrape(url)
    player_links = [i['href'] for i in soop.select(
        '.sidearm-roster-players-container .sidearm-roster-player h3 > a')]
    # scrape the roster links
    for link in player_links:
        player_ = url   link[23:]
        # print(player_)
        return(find_data(player_))


print(type1_roster('gvsulakers'))