Trouble finding links in very large string-CodePudding

I am scraping baseball reference for a data science project, and have come a cross an issue when trying to scrape player data from a specific league. A league that jsut started playing this season. When I scrape old leagues that have already finished playing I have no issues. But I want to scrape the league at this link: https://www.baseball-reference.com/register/league.cgi?id=c346199a live as the season goes. However the links are hidden behind a lot of what seem to be plain text. So BeautifulSoup.find_all('a', href = True) does not work.

So instead here is what my thought process has been so far.

html = BeautifulSoup(requests.get('https://www.baseball-reference.com/register/league.cgi?id=c346199a').text, features = 'html.parser').find_all('div')
ind = [str(div) for div in html][0]
orig_ind = ind[ind.find('/register/team.cgi?id='):]
count = orig_ind.count('/register/team.cgi?id=')

team_links = []
for i in range(count):
  # rn finds the same one over and over
  link = orig_ind[orig_ind.find('/register/team.cgi?id='):orig_ind.find('title')].strip().replace('"', '')
  # try to remove it from orig_ind and do the next link...
  # this is the part that is not working rn
  orig_ind = orig_ind.replace(link, '')
  team_links.append('https://baseball-reference.com'   link)

Which outputs:

['https://baseball-reference.com/register/team.cgi?id=71fe19cd',
 'https://baseball-reference.com',
 'https://baseball-reference.com',
 'https://baseball-reference.com',
 'https://baseball-reference.com',
 'https://baseball-reference.com',
 'https://baseball-reference.com',
 'https://baseball-reference.com',
 'https://baseball-reference.com',
 'https://baseball-reference.com',
 'https://baseball-reference.com',

and so on. I am trying to get all of the team links from this page: https://www.baseball-reference.com/register/league.cgi?id=c346199a

and then crawl over to the player links on each of those pages and collect some data. Like I said it works on pretty much every single league I have ever tried on except for this one.

Any help is greatly appriciated.

CodePudding user response：

The tables you see on this site is stored inside HTML comments () so BeautifulSoup normally doesn't see them. To parse them try next example:

import requests
from bs4 import BeautifulSoup, Comment


soup = BeautifulSoup(
    requests.get(
        "https://www.baseball-reference.com/register/league.cgi?id=c346199a"
    ).text,
    features="html.parser",
)

s = "".join(c for c in soup.find_all(text=Comment) if "table_container" in c)
soup = BeautifulSoup(s, "html.parser")

for a in soup.select('[href*="/register/team.cgi?id="]'):
    print("{:<30} {}".format(a.text, a["href"]))

Prints:

Battle Creek Bombers           /register/team.cgi?id=f3c4b615
Kenosha Kingfish               /register/team.cgi?id=71fe19cd
Kokomo Jackrabbits             /register/team.cgi?id=8f1a41fc
Rockford Rivets                /register/team.cgi?id=9f4fe2ef
Traverse City Pit Spitters     /register/team.cgi?id=7bc8d111
Kalamazoo Growlers             /register/team.cgi?id=9995d2a1
Fond du Lac Dock Spiders       /register/team.cgi?id=02911efc

...and so on.