Home > OS >  Scraping HREF Links contained within a Table
Scraping HREF Links contained within a Table

Time:07-18

I've been bouncing around a ton of similar questions, but nothing that seems to fix the issue... I've set this up (with help) to scrape the HREF tags from a different URL.

I'm trying to now take the HREF links in the "Result" column from this URL

https://stats.ncaa.org/player/game_by_game?game_sport_year_ctl_id=15881&id=15881&org_id=2&stats_player_seq=-100

The script doesn't seem to be working like it did for other sites.

The table is an HTML element, but no matter how I tweak my script, I can't retrieve anything except a blank result.

Could someone explain to me why this is the case? I'm watching many YouTube videos trying to understand, but this just doesn't make sense to me.

import requests from bs4 import BeautifulSoup

profiles = []
urls = [
    'https://stats.ncaa.org/player/game_by_game?game_sport_year_ctl_id=15881&id=15881&org_id=6&stats_player_seq=-100'


]
for url in urls:
    req = requests.get(url)
    soup = BeautifulSoup(req.text, 'html.parser')
    for profile in soup.find_all('a'):

        profile = profile.get('href')

        profiles.append(profile)

print(profiles)

CodePudding user response:

The following code works:

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.60 Safari/537.17'}

r = requests.get('https://stats.ncaa.org/player/game_by_game?game_sport_year_ctl_id=15881&id=15881&org_id=6&stats_player_seq=-100', headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
for x in soup.select('a'):
    print(x.get('href'))

CodePudding user response:

Main issue in that case is that you miss to send a user-agent, cause some sites, regardless of whether it is a good idea, use this as base to decide that you are a bot and do not or only specific content.

So minimum is to provide some of that infromation while making your request:

req = requests.get(url,headers={'User-Agent': 'Mozilla/5.0'})

Also take a closer look to your selection. Assuming you like to get the team links only you should adjust it, I used css selectors:

for profile in soup.select('table a[href^="/team/"]'):

It also needs concating the baseUrl to the extracted values:

profile = 'https://stats.ncaa.org' profile.get('href')

Example

from bs4 import BeautifulSoup
import requests

profiles = []
urls = ['https://stats.ncaa.org/player/game_by_game?game_sport_year_ctl_id=15881&id=15881&org_id=6&stats_player_seq=-100']

for url in urls:
    req = requests.get(url,headers={'User-Agent': 'Mozilla/5.0'})
    soup = BeautifulSoup(req.text, 'html.parser')
    for profile in soup.select('table a[href^="/team/"]'):
        profile = 'https://stats.ncaa.org' profile.get('href')
        profiles.append(profile)

print(profiles)
  • Related