Home > Enterprise >  Web Scraping from dynamic real time website
Web Scraping from dynamic real time website

Time:10-20

I´ve been hardly trying to scrape the following data from this page: https://lambda-app-eia.herokuapp.com/

I need to scrape the numbers selected: in the following image.

Im trying to create a list so that i can treat them as data types , and make some calculations. i´ve been told bs4 does not read dynamic websites, so i switched to selenium instead, making the followinng code:

from selenium import webdriver 
from selenium.webdriver.common.by import By

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC



chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

driver = webdriver.Chrome('chromedriver', chrome_options=chrome_options)

driver.get("https://lambda-app-eia.herokuapp.com/") 

Then I try to create a list:

elements = driver.find_elements(By.CSS_SELECTOR, ".MuiTypography-root.MuiTypography-h4.css-2voflx")

job_list = []
for job in elements:
    job_list.append(job.get_attribute('href'))
print(job_list)

And i get as a result a lenght 4 None list.

I suspect it has something to do with the initial searching CSS_SELECTOR, since i took the data as a class from the font code, or probably something to do with the href, which somehow "filters" the data number, but im kind of lost at this point. I have never worked with such libraries so my errors might be pretty fundamental. Of course, ANY help is strongly appreciated.

CodePudding user response:

You have to get the text of the element not the href, like below, also add some wait time:

time.sleep(2)
elements = driver.find_elements(By.CSS_SELECTOR, ".MuiTypography-root.MuiTypography-h4.css-2voflx")

job_list = []
for job in elements:
    job_list.append(job.text)
print(job_list)
  • Related