Wait for some time before getting the website source code-CodePudding

I am trying to scrape a website to get the heading and summary of the news. The problem I am facing is that when we first open the website, a redirect appears and we have to wait 8 seconds for the website to load. The problem I am facing is that the web data that is beign stored is that of the redirect instead of the main website.

from selenium import webdriver
import time
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup

# Specify the path to the ChromeDriver executable
chrome_driver_path = "C:/webdrivers/chromedriver"

# Initialize the webdriver
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

# Navigate to website
driver.get("https://economictimes.indiatimes.com/markets/stocks/news")
time.sleep(10)
data2, data4 = [], []

while True:
    # Extract data
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    data = soup.find_all("div", {"class": "example-class"})
    for item in data:
        data2.append(item.find_all('h3'))
        data4.append(item.find_all('p'))

    try:
        # Find the "Load More" button
        load_more_button = driver.find_element_by_css_selector("div.autoload_continue")
        # Click the button
        load_more_button.click()
    except:
        break

# Close the browser
driver.quit()

print(data2)

CodePudding user response：

You could check for switch to your final url:

wait.until(EC.url_to_be('https://economictimes.indiatimes.com/markets/stocks/news'))

Example

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

url = 'https://economictimes.indiatimes.com/markets/stocks/news'
wait = WebDriverWait(driver, 10)

driver.get(url)
wait.until(EC.url_to_be('https://economictimes.indiatimes.com/markets/stocks/news'))

CodePudding user response：

An ideal approach would be to wait for the News heading within the webpage to be visibible.

Solution

To wait for the News heading to be visibible you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following locator strategies:

Using CSS_SELECTOR:

driver.get('https://economictimes.indiatimes.com/markets/stocks/news')
WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "h1.h1")))

Using XPATH:

driver.get('https://economictimes.indiatimes.com/markets/stocks/news')
WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//h1[@class='h1' and text()='News']")))

Note : You have to add the following imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

Alternative

You can also wait for the Page Title of the webpage to contain Stocks in News Today as follows:

driver.get('https://economictimes.indiatimes.com/markets/stocks/news')
WebDriverWait(driver, 10).until(EC.title_contains("Stocks in News Today"))

References

You can find a couple of relevant detailed discussions in: