I need to collect all links from a webpage as seen below, which also has a load more news button. I wrote my script, but my script gives only the links from the first page, as if it does not click on the load more news button. I updated some of Selenium attributes. I really don't know why I could not get all the links, clicking on load_more button.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from time import sleep
import json
options = webdriver.ChromeOptions()
driver = webdriver.Chrome(options=options)
url = "https://www.mofaic.gov.ae/en/MediaHub/news?categoryId=f9048938-c577-4caa-b1d9-ae1b7a5f1b20"
base_url = "https://www.mofaic.gov.ae"
driver.get(url)
outlinks = []
wait = WebDriverWait(driver, 90)
load_more_button = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'a.listing-load-more-btn[title="Load More News"]')))
num_links = 0
while True:
links = driver.find_elements(By.CSS_SELECTOR, 'a.text-truncate')
num_links_new = len(links)
if num_links_new > num_links:
num_links = num_links_new
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
load_more_button = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'a.listing-load-more-btn[title="Load More News"]')))
if load_more_button.is_displayed():
load_more_button.click()
sleep(10)
else:
break
new_links = driver.find_elements(By.CSS_SELECTOR, 'a.text-truncate')
for link in new_links:
href = link.get_attribute('href')
full_url = base_url href
enurl=full_url.replace("ar-ae", "en")
outlinks.append(enurl)
print(outlinks)
data = json.dumps(outlinks)
with open('outlinks.json', 'w') as f:
f.write(data)
driver.close()
CodePudding user response:
Although you have tagged selenium, this is a much better way to handle it.
Whenever you click on the "load more" button, it sends a POST
request to:
https://www.mofaic.gov.ae/api/features/News/NewsListPartialView
So, you can just get all the data from there directly using the requests
/BeautifulSoup
modules. There's no need for Selenium, and the process will be much faster!
import requests
from bs4 import BeautifulSoup
data = {
"CurrentPage": "1",
"CurrentRenderId": "{439EC71A-4231-45C8-B075-975BD41099A7}",
"CategoryID": "{f9048938-c577-4caa-b1d9-ae1b7a5f1b20}",
"PageSize": "6",
}
BASE_URL = "https://www.mofaic.gov.ae"
POST_URL = "https://www.mofaic.gov.ae/api/features/News/NewsListPartialView"
response = requests.post(
POST_URL,
data=data,
)
for page in range(
1, 10
): # <-- Increase this number to get more Articles - simulates the "load more" button.
data["CurrentPage"] = page
response = requests.post(
POST_URL,
data=data,
)
soup = BeautifulSoup(response.text, "html.parser")
for link in soup.select("a.text-truncate"):
print(BASE_URL link["href"])
Prints (truncated):
https://www.mofaic.gov.ae/ar-ae/mediahub/news/2023/1/2/02-01-2023-uae-leaders
https://www.mofaic.gov.ae/ar-ae/mediahub/news/2023/1/2/02-01-2023-uae-vatican
https://www.mofaic.gov.ae/ar-ae/mediahub/news/2023/1/2/02-01-2023-uae-fm
https://www.mofaic.gov.ae/ar-ae/mediahub/news/2023/1/1/01-01-2022-uae-cuba
https://www.mofaic.gov.ae/ar-ae/mediahub/news/2023/1/1/01-01-2022-uae-sudan
https://www.mofaic.gov.ae/ar-ae/mediahub/news/2023/1/1/01-01-2023-uae-israel