Collecting links from a Webpage using Selenium-Load More Problem-CodePudding

I need to collect all links from a webpage as seen below, which also has a load more news button. I wrote my script, but my script gives only the links from the first page, as if it does not click on the load more news button. I updated some of Selenium attributes. I really don't know why I could not get all the links, clicking on load_more button.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from time import sleep
import json


options = webdriver.ChromeOptions()


driver = webdriver.Chrome(options=options)


url = "https://www.mofaic.gov.ae/en/MediaHub/news?categoryId=f9048938-c577-4caa-b1d9-ae1b7a5f1b20"
base_url = "https://www.mofaic.gov.ae"


driver.get(url)

outlinks = []

wait = WebDriverWait(driver, 90)
load_more_button = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'a.listing-load-more-btn[title="Load More News"]')))


num_links = 0


while True:
    links = driver.find_elements(By.CSS_SELECTOR, 'a.text-truncate')
    num_links_new = len(links)
   
    if num_links_new > num_links:
        
        num_links = num_links_new
       
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        
        load_more_button = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'a.listing-load-more-btn[title="Load More News"]')))
        
        if load_more_button.is_displayed():
           
            load_more_button.click()
            
            sleep(10)
    else:
       
        break



new_links = driver.find_elements(By.CSS_SELECTOR, 'a.text-truncate')
for link in new_links:
        href = link.get_attribute('href')
        full_url = base_url   href
        enurl=full_url.replace("ar-ae", "en")
        outlinks.append(enurl)

print(outlinks)

data = json.dumps(outlinks)


with open('outlinks.json', 'w') as f:
    f.write(data)


driver.close()

CodePudding user response：

Although you have tagged selenium, this is a much better way to handle it.

Whenever you click on the "load more" button, it sends a POST request to:

https://www.mofaic.gov.ae/api/features/News/NewsListPartialView

So, you can just get all the data from there directly using the requests/BeautifulSoup modules. There's no need for Selenium, and the process will be much faster!

import requests
from bs4 import BeautifulSoup


data = {
    "CurrentPage": "1",
    "CurrentRenderId": "{439EC71A-4231-45C8-B075-975BD41099A7}",
    "CategoryID": "{f9048938-c577-4caa-b1d9-ae1b7a5f1b20}",
    "PageSize": "6",
}

BASE_URL = "https://www.mofaic.gov.ae"
POST_URL = "https://www.mofaic.gov.ae/api/features/News/NewsListPartialView"
response = requests.post(
    POST_URL,
    data=data,
)


for page in range(
    1, 10
):  # <-- Increase this number to get more Articles - simulates the "load more" button.
    data["CurrentPage"] = page
    response = requests.post(
        POST_URL,
        data=data,
    )
    soup = BeautifulSoup(response.text, "html.parser")

    for link in soup.select("a.text-truncate"):
        print(BASE_URL   link["href"])

Prints (truncated):

https://www.mofaic.gov.ae/ar-ae/mediahub/news/2023/1/2/02-01-2023-uae-leaders
https://www.mofaic.gov.ae/ar-ae/mediahub/news/2023/1/2/02-01-2023-uae-vatican
https://www.mofaic.gov.ae/ar-ae/mediahub/news/2023/1/2/02-01-2023-uae-fm
https://www.mofaic.gov.ae/ar-ae/mediahub/news/2023/1/1/01-01-2022-uae-cuba
https://www.mofaic.gov.ae/ar-ae/mediahub/news/2023/1/1/01-01-2022-uae-sudan
https://www.mofaic.gov.ae/ar-ae/mediahub/news/2023/1/1/01-01-2023-uae-israel