Home > Back-end >  Python BeautifulSoup not extracting every URL
Python BeautifulSoup not extracting every URL

Time:12-04

I'm trying to find all the URLs on this page: https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-all-departments

More specifically, I want the links that are hyperlinked under each "Subject Code". However, when I run my code, barely any links get extracted.

I would like to know why this is happening, and how I can fix it.

from bs4 import BeautifulSoup
import requests

url = "https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-all-departments"

page = requests.get(url)
soup = BeautifulSoup(page.text, features="lxml")

for link in soup.find_all('a'):
    print(link.get('href'))

This is my first attempt in web-scraping..

CodePudding user response:

there's an anti-bot protection, just add a user-agent to your headers. and do not forget to check your soup when things go wrong

from bs4 import BeautifulSoup
import requests

url = "https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-all-departments"
ua={'User-Agent':'Mozilla/5.0 (Macintosh; PPC Mac OS X 10_8_2) AppleWebKit/531.2 (KHTML, like Gecko) Chrome/26.0.869.0 Safari/531.2'}
r = requests.get(url, headers=ua)
soup = BeautifulSoup(r.text, features="lxml")

for link in soup.find_all('a'):
    print(link.get('href'))

the message in the soup was

Sorry for the inconvenience.

We have detected excess or unusual web requests originating from your browser, and are unable to determine whether these requests are automated.

To proceed to the requested page, please complete the captcha below.

CodePudding user response:

I would use nth-child(1) to restrict to the first column of the table matched by id. Then simply extract the .text. If that contains * then provide a default string for no course offered, otherwise, concatenate the retrieved course identifier onto a base query string construct:

import requests
from bs4 import BeautifulSoup as bs

headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get('https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-all-departments', headers=headers)
soup = bs(r.content, 'lxml')
no_course = ''
base = 'https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-department&dept='
course_info = {i.text:(no_course if '*' in i.text else base   i.text) for i in soup.select('#mainTable td:nth-child(1)')}
course_info
  • Related