Home > Mobile >  Beautiful Soup not working with requests.get
Beautiful Soup not working with requests.get

Time:05-16

So I am a python beginner trying to scrape this website http://www.edwaittimes.ca/WaitTimes.aspx that gives wait times for hospitals. Currently I am trying to print all the names of the hospitals.

My code works if the .html file is in the folder with the python file I am working with

from bs4 import BeautifulSoup
import requests


def print_hospitals():
    with open('website.html','r') as html_file:
        content = html_file.read()
        soup = BeautifulSoup(content, 'lxml')
        hospital_table = soup.find_all('div',class_="Row")
        for hospital in hospital_table:
            if hospital.a is not None:
                print(hospital.a.text)

but when I use the requests.get with the URL. The code prints nothing. There are no error messages either.

from bs4 import BeautifulSoup
import requests

def print_hospitals_request():
    html_text = requests.get('http://www.edwaittimes.ca/WaitTimes.aspx').text
    soup = BeautifulSoup(html_text, 'lxml')
    hospital_table = soup.find_all('div',class_="Row")
    for hospital in hospital_table:
        if hospital.a is not None:
            print(hospital.a.text)

Can anyone please help me with this issue

CodePudding user response:

The page is loading the data from external URLs using Ajax. So beautifulsoup doesn't see anything. To load the data you can use next example:

import requests
from bs4 import BeautifulSoup


hospitals_csv = "http://www.edwaittimes.ca/Shared/Images/sites2.csv"

data = [
    l.split("|")[:-1]
    for l in requests.get(hospitals_csv).text.splitlines()[:-1]
]

all_data = ""
for hospital, city in data:
    url = (
        "http://www.edwaittimes.ca/Shared/Images/"
          hospital
          (".html" if city == "Vancouver" else "_gp.html")
    )
    print(f"Getting {url}")
    all_data  = requests.get(url).text

soup = BeautifulSoup(all_data, "html.parser")
for row in soup.select(".Row"):
    print(row.get_text(strip=True, separator=" "))

Prints:

Lions Gate Hospital Patients of all ages seen 02:28 05:06
North Van Urgent & Primary Care Centre Patients of all ages seen UPCC is for mild to moderate illness 01:38 04:15
Squamish General Hospital Patients of all ages seen 01:39 02:16
Whistler Health Care Centre Patients of all ages seen 00:43 01:52
Pemberton Health Centre Patients of all ages seen No patients seen in the last two hours. 02:05
Sechelt Hospital Patients of all ages seen 02:08 04:52
Richmond Hospital Patients of all ages seen 02:36 05:16
Richmond Urgent and Primary Care Centre Patients of all ages seen (lab offsite) UPCC is for mild to moderate illness 03:46 03:28
Vancouver General Hospital Patients of ages 17 and older seen 02:18 05:40
St. Paul's Hospital Patients of all ages seen 00:34 04:26
Mount Saint Joseph Hospital Patients of all ages seen 01:01 02:35
UBC Hospital (UBCH) Patients of all ages seen UBCH is for mild to moderate illness 01:22 01:22
City Centre Urgent & Primary Care Centre Patients of all ages seen UPCC is for mild to moderate illness 01:49 02:30
REACH Urgent and Primary Care Centre Patients of all ages seen (lab & x-ray offsite) UPCC is for mild to moderate illness Currently open, call (604) 216-3138 for wait time
Northeast Urgent and Primary Care Centre Patients of all ages seen (lab & x-ray offsite) UPCC is for mild to moderate illness 02:50 02:50
Southeast Urgent and Primary Care Centre Patients of all ages seen (lab & x-ray offsite) UPCC is for mild to moderate illness 02:12 01:52
BC Children's Hospital Patients seen up to age 16 02:23 04:39

CodePudding user response:

The class you are looking for does not seem to exist on the webpage you are scraping. I replaced the class_="Row" with class_="grid_8" which is a class that exist on the webpage and it worked:

from bs4 import BeautifulSoup
import requests


def print_hospitals_request():
    html_text = requests.get('http://www.edwaittimes.ca/WaitTimes.aspx').text
    soup = BeautifulSoup(html_text, 'lxml')
    hospital_table = soup.find_all('div', class_="grid_8")
    for hospital in hospital_table:
        if hospital.a is not None:
            print(hospital.a.text)


print_hospitals_request()

CodePudding user response:

Beautiful Soup and requests are working fine. And what you did in theory works. Here's the thing, the html you're reading off of is a result of the site itself making another request and then populating a table based on that. If you go in and use the developer tools on the browser, you'll see a form element with a specific action. My guess is that a get request populates the initial html a user sees, then the form request and some javascript get the data from a server.

There's no error because, that is the result of the get request. I'm not sure what calling a post request to that form would do, and I'm not sure of the terms or conditions of use of that website.

Assuming that you do have permission to work with that API and this isn't just idle curiosity. You can go one of two routes. You can try and emulate the request the page makes by using get instead of post. The other is to use selenium (by python binding or some other method) to open up browser, call a wait till some element is present or a timeout occurs, then use selenium to scrape the page instead of bs4.

If this is for practice, I used bs4 on wikipedia, that's an excellent source of open content that includes plenty of tables and sent it all a raw html.

CodePudding user response:

The url is dynamic. So you can use selenium with bs4

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('http://www.edwaittimes.ca/WaitTimes.aspx')
time.sleep(2)

soup = BeautifulSoup(driver.page_source,'lxml')
hospital_table = soup.find_all('div',class_="Row")
for hospital in hospital_table:
    if hospital.a is not None:
        print(hospital.a.text)

Output

Mount Saint Joseph Hospital
UBC Hospital (UBCH)
City Centre Urgent & Primary Care Centre
Vancouver General Hospital
St. Paul's Hospital
REACH Urgent and Primary Care Centre
Northeast Urgent and Primary Care Centre
Southeast Urgent and Primary Care Centre
BC Children's Hospital
Richmond Hospital
Richmond Urgent and Primary Care Centre
Squamish General Hospital
Whistler Health Care Centre
Lions Gate Hospital
Sechelt Hospital
Pemberton Health Centre
North Van Urgent & Primary Care Centre
  • Related