web scraping table with selenium gets only html elements but no content-CodePudding

I am trying to scrap tables using selenium and beautifulsoup from this 3 websites:

For all 3 websites result is HTML code for the table but without text.

My code is below:

    import requests
from bs4 import BeautifulSoup
import pyodbc
import datetime

from selenium import webdriver

PATH = r'C:\Users\xxxxxx\AppData\Local\chromedriver.exe'

driver = webdriver.Chrome(PATH)

driver.get('https://www.erstebank.hr/hr/tecajna-lista')

driver.implicitly_wait(10)

soup = BeautifulSoup(driver.page_source, 'lxml')

table = soup.find_all('table')

print(table)

driver.close()

Please help what am I missing?

Thank you

CodePudding user response：

BeautifulSoup will not find the table as it doesn't exist from it's reference point. Here, you tell Selenium to pause the Selenium driver matcher if it notices that an element is not present yet:

# This only works for the Selenium element matcher
driver.implicitly_wait(10)

Then, right after that, you get the current HTML state (table still does not exist) and put it into BeautifulSoup's parser. BS4 will not be able to see the table, even if it loads in later, because it will use the current HTML code you just gave it:

# You now move the CURRENT STATE OF THE HTML PAGE to BeautifulSoup's parser
soup = BeautifulSoup(driver.page_source, 'lxml')

# As this is now in BS4's hands, it will parse it immediately (won't wait 10 seconds)
table = soup.find_all('table')

# BS4 finds no tables as, when the page first loads, there are none.

To fix this, you can ask Selenium to try and get the HTML table itself. As Selenium will use the implicitly_wait you specified earlier, it will wait until it exists, and only then allow the rest of the code execution to persist. At that point, when BS4 receives the HTML code, the table will be there.

driver.implicitly_wait(10)

# Selenium will wait until the element is found
# I used XPath, but you can use any other matching sequence to get the table
driver.find_element_by_xpath("/html/body/div[2]/main/div/section/div[2]/div[1]/div/div/div/div/div/div/div[2]/div[6]/div/div[2]/table/tbody/tr[1]")

soup = BeautifulSoup(driver.page_source, 'lxml')

table = soup.find_all('table')

However, this is a bit overkill. Yes, you can use Selenium to parse the HTML, but you could also just use the requests module (which, from your code, I see you already have imported) to get the table data directly.

The data is asynchronously loaded from this endpoint (you can use the Chrome DevTools to find it yourself). You can pair this with the json module to turn it into a nicely formatted dictionary. Not only is this method faster, but it is also much less resource intensive (Selenium has to open a whole browser window).

from requests import get
from json import loads

# Get data from URL
data_as_text = get("https://local.erstebank.hr/rproxy/webdocapi/fx/current").text

# Turn to dictionary
data_dictionary = loads(data_as_text)

CodePudding user response：

The Website is taking time to load the data in the table.

Either Apply time.sleep

import time

driver.get('https://www.erstebank.hr/hr/tecajna-lista')
time.sleep(10)...

Or apply Explicit wait such that the rows are loaded in the tabel.

import requests
from bs4 import BeautifulSoup

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait

driver = webdriver.Chrome(executable_path="path to chromedriver.exe")
driver.maximize_window()

driver.get('https://www.erstebank.hr/hr/tecajna-lista')

wait = WebDriverWait(driver,30)
wait.until(EC.presence_of_all_elements_located((By.XPATH,"//table/tbody/tr[@class='ng-scope']")))

# driver.find_element_by_id("popin_tc_privacy_button_2").click() # Cookie setting pop-up. Works fine even without dealing with this pop-up. 
soup = BeautifulSoup(driver.page_source, 'html5lib')

table = soup.find_all('table')

print(table)

CodePudding user response：

You can use this as the foundation for further work:-

from bs4 import BeautifulSoup as BS
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

TDCLASS = 'ng-binding'

options = webdriver.ChromeOptions()
options.add_argument('--headless')
with webdriver.Chrome(options=options) as driver:
    driver.get('https://www.erstebank.hr/hr/tecajna-lista')
    try:
        # There may be a cookie request dialogue which we need to click through
        WebDriverWait(driver, 5).until(EC.presence_of_element_located(
            (By.ID, 'popin_tc_privacy_button_2'))).click()
    except Exception:
        pass  # Probably timed out so ignore on the basis that the dialogue wasn't presented
    # The relevant <td> elements all seem to be of class 'ng-binding' so look for those
    WebDriverWait(driver, 5).until(
        EC.presence_of_element_located((By.CLASS_NAME, TDCLASS)))
    soup = BS(driver.page_source, 'lxml')
    for td in soup.find_all('td', class_=TDCLASS):
        print(td)