Home > Blockchain >  Scraping a Dynamic Website using Selenium or Beautiful Soup
Scraping a Dynamic Website using Selenium or Beautiful Soup

Time:11-19

I am trying to web scrape this dynamic website to get the course names and lecture time offered during a semester: https://www.utsc.utoronto.ca/registrar/timetable

The problem is when you first enter the website there are no courses displayed yet, only after selecting the "Session(s)" and clicking "Search for Courses" will the courses start to show up.

Here is where the problems start:

  1. I cannot do
html = urlopen(url).read()

using urllib.request, as it will only display the HTML of the page when there is nothing.

  1. I did quick search on how to webscrape dynamic website and run across a code like this:
import requests
url = 'https://www.utsc.utoronto.ca/registrar/timetable'

r= requests.get(url)
data = r.json()
print(data)

however, when I run this it returns "JSONDecodeError: Expecting value" and I have no idea why this occurs when it has worked on other dynamic websites.

I do not really have to use Selenium or Beautiful Soup so if there are better alternatives I will gladly try it. Also I was wondering when:

html = urlopen(url).read()

what is the format of the html that is returned? I want to know if I can just copy the changed HTML from inspecting the website after selecting the Session(s) and clicking search.

ps: this is my first time using asking in stackoverflow, so please let me know if my question is not clear enough, etc. Sorry and thanks in advanced!

CodePudding user response:

def render_page(url):
    driver = webdriver.Chrome(PATH)
    driver.get(url)
    r = driver.page_source
    driver.quit()
    return r

#render page using chrome driver and get all the html code on that certain webpage

def create_soup(html_text):
    soup = BeautifulSoup(html_text, 'lxml')
    return soup

You will need to use selenium for this if the content is loaded dynamically. Create a Beutiful Soup with the returned value from render_page() and see if you can manipulate the data there.

CodePudding user response:

you can use this code to get the data you need :

import requests

url = "https://www.utsc.utoronto.ca/regoffice/timetable/view/api.php"

# for winter session
payload = "coursecode=&sessions[]=20219&instructor=&courseTitle="

headers = {
  'content-type': 'application/x-www-form-urlencoded; charset=UTF-8'
}

response = requests.request("POST", url, headers=headers, data=payload)

print(response.text)
  • Related