Home > database >  I cannot scrape a table from a website with usual web scraping tools
I cannot scrape a table from a website with usual web scraping tools

Time:11-07

I am trying to scrape a table from a website with Python but for some reason all of my known methods have failed. There's a table at https://www.nbc4i.com/news/state-news/535-new-cases-of-covid-19-reported-in-ohio-schools-in-past-week/ with 45 pages. I have tried to scrape it with using: requests, requests-html (rendered it), BeautifulSoup and selenium as well. This is one of my codes, I won't copy here all of those I tried, methods are similar just with different Python libraries:

from requests_html import HTMLSession
from bs4 import BeautifulSoup

session = HTMLSession()
page = session.get('https://www.nbc4i.com/news/state-news/535-new-cases-of-covid-19-reported-in-ohio-schools-in-past-week/')
page.html.render(timeout=120)
soup = BeautifulSoup(page.content, 'lxml') #also tried with page.text and 'html.parser' and all permutations
table = soup.find_all(id='table')

My table variable is an empty list here and it shouldn't be. I've tried to find any other web elements within the table with selenium, I tried to find by class, xpath as well, but all of these have failed to find the table or any part of it. I scraped quite few similar websites with these methods and I have never had a problem before this one. Any ideas, please?

CodePudding user response:

You'd see that the result table is in an iframe. You can extract the information directly from the source of the iframe:

https://flo.uri.sh/visualisation/3894531/embed?auto=1

Here the code that should save the result onto a .csv file:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

def get_rows(driver):
    """
    returns rows from a page
    
    Returns:
    Dict
    """
    WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, "//div[@class='tr body-row']")))
    rows = driver.find_elements(By.XPATH, "//div[@class='tr body-row']")
    table_info= {
        'Rank': [],
        'County':[],
        'School/District':[],
        'Type':[],
        'Total cases':[],
        'Student cases':[],
        'Staff cases':[]
    }
    
    for row in rows:
        cols = row.find_elements(By.CLASS_NAME, 'td')
        for col, index in enumerate(table_info):
            table_info[index].append(cols[col].text)

    return table_info

# path to chrome driver
driver = webdriver.Chrome("D:\chromedriver\94\chromedriver.exe")

driver.get("https://flo.uri.sh/visualisation/3894531/embed?auto=1")


df = pd.DataFrame.from_dict(get_rows(driver))

for _ in range(44):
    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//button[@]'))).click()
    df = pd.concat([df, pd.DataFrame.from_dict(get_rows(driver))])

print(df)
df.to_csv('COVID-19_cases_reported_in_Ohio_schools.csv', index=False)

CodePudding user response:

Issue is the content is in an iframe and need to switch to the iframe page. See API docs.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

url = 'https://www.nbc4i.com/news/state-news/535-new-cases-of-covid-19-reported-in-ohio-schools-in-past-week/'
s = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=s)
try:
    driver.get(url)
    driver.implicitly_wait(5)
    driver.switch_to.frame(driver.find_element(By.XPATH,
            '//div[@]/div/iframe'))
    # table content is now in the driver context
    while True:
        table = driver.find_element(By.ID, "table")
        for elt in table.find_elements(By.CLASS_NAME, "body-row"):
            x = [td.text for td in elt.find_elements(By.CLASS_NAME, "td")]
            # add code to append each of row of data to CSV file, database, etc.
            print(x)
        next_btn = driver.find_element(By.CLASS_NAME, 'next')        
        if 'disabled' in next_btn.get_attribute('class'):
            # no more > done with pagination
            break
        next_btn.click() # click next button for next set of items
finally:
    driver.quit()

Outputs:

['1', 'Delaware', 'Olentangy Local', 'Public District', '38', '31', '7']
...
['446', 'Muskingum', 'West Muskingum Local', 'Public District', '1', '1', '0']
  • Related