Home > front end >  Collecting links from a JS-Based Webpage using Selenium
Collecting links from a JS-Based Webpage using Selenium

Time:01-19

I need to collect all links from a webpage as seen below (25 links from each 206 pages, around 5200 total links), which also has a load more news button (as three dots). I wrote my script, but my script does not give any links that I tried to collect. I updated some of Selenium attributes. I really don't know why I could not get all the links.

from selenium import webdriver
from bs4 import BeautifulSoup
import time
from selenium.webdriver.common.by import By


from selenium.webdriver import Chrome


#Initialize the Chrome driver
driver = webdriver.Chrome()


driver.get("https://www.mfa.gov.tr/sub.en.mfa?ad9093da-8e71-4678-a1b6-05f297baadc4")


page_count = driver.find_element(By.XPATH, "//span[@class='rgInfoPart']")
text = page_count.text
page_count = int(text.split()[-1])


links = []


for i in range(1, page_count   1):
    # Click on the page number
    driver.find_element(By.XPATH, f"//a[text()='{i}']").click()
    time.sleep(5)
    # Wait for the page to load
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    # Extract the links from the page
    page_links = soup.find_all('div', {'class': 'sub_lstitm'})
    for link in page_links:
        links.append("https://www.mfa.gov.tr" link.find('a')['href'])
    time.sleep(5)

driver.quit()

print(links)

I tried to run my code but actually I couldn't. I need to have some solution for this.

CodePudding user response:

You can easily do everything in Selenium using the following method:

  1. Wait for the links to be visible on the page
  2. Get titles and urls
  3. Get the current page number
  4. If there is the button for the next page, then click it and repeat from step 1., otherwise it means we are in the last page hence the execution ends

Add these imports

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

and then run the following code

driver.get("https://www.mfa.gov.tr/sub.en.mfa?ad9093da-8e71-4678-a1b6-05f297baadc4")

titles, urls = [], []

while 1:
    links = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.sub_lstitm a")))
    for link in links:
        titles.append( link.text )
        urls.append( link.get_attribute('href') )
    
    current_page = int(driver.find_element(By.CSS_SELECTOR, 'td span').text)
    print('current page:', current_page, end='\r')
    try:
        next_page_button = driver.find_element(By.XPATH, f'//a[text()={current_page 1}]')
    except:
        print('current page is the last one')
        break
    next_page_button.click()

for i in range(len(titles)):
    print(titles[i],'\n',urls[i],'\n')

Output

No: 17, 18 January 2023, Press Release Regarding the Consular Consultations Between Türkiye and Yemen 
 https://www.mfa.gov.tr/no_-17_-turkiye-ve-yemen-arasinda-gerceklestirilecek-konsolosluk-istisareleri-hk.en.mfa 

No: 16, 17 January 2023, Press Release Regarding the Visit of H.E. Mr. Mevlüt Çavuşoğlu, Minister of Foreign Affairs of the Republic of Türkiye, to the U.S. 
 https://www.mfa.gov.tr/no_-16_-sayin-bakanimizin-abd-ziyareti-hk.en.mfa 

No: 15, 16 January 2023, Press Release Regarding the Visit of H.E. Mr. Hossein Amir Abdollahian, Minister of Foreign Affairs of the Islamic Republic of Iran, to Türkiye 
 https://www.mfa.gov.tr/no_-15_-iran-islam-cumhuriyeti-disisleri-bakani-huseyin-emir-abdullahiyan-in-ulkemize-yapacagi-ziyaret-hk.en.mfa 

No: 14, 15 January 2023, Press Release Regarding the Plane Crash in Nepal 
 https://www.mfa.gov.tr/no_-14_-nepal-de-meydana-gelen-ucak-kazasi-hk.en.mfa 

No: 13, 14 January 2023, Press Release Regarding the Visit of H.E. Ms. Bisera Turkovic, Deputy Chairperson of the Council of Ministers and Minister of Foreign Affairs of Bosnia and Herzegovina, to Türkiye 
 https://www.mfa.gov.tr/no_-13_-bosna-hersek-bakanlar-konseyi-baskan-yrd-ve-disisleri-bakani-bisera-turkovic-in-ulkemizi-ziyareti-hk.en.mfa 

...

CodePudding user response:

Using only Selenium you can easily collect all links from the webpage inducing WebDriverWait for visibility_of_all_elements_located() and you can use either of the following locator strategies:

  • Using CSS_SELECTOR:

    driver.get('https://www.mfa.gov.tr/sub.en.mfa?ad9093da-8e71-4678-a1b6-05f297baadc4')
    print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.sub_lstitm > a")))])
    
  • Using XPATH:

    driver.get('https://www.mfa.gov.tr/sub.en.mfa?ad9093da-8e71-4678-a1b6-05f297baadc4')
    print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='sub_lstitm']/a")))])
    
  • Console Output:

    ['https://www.mfa.gov.tr/no_-17_-turkiye-ve-yemen-arasinda-gerceklestirilecek-konsolosluk-istisareleri-hk.en.mfa', 'https://www.mfa.gov.tr/no_-16_-sayin-bakanimizin-abd-ziyareti-hk.en.mfa', 'https://www.mfa.gov.tr/no_-15_-iran-islam-cumhuriyeti-disisleri-bakani-huseyin-emir-abdullahiyan-in-ulkemize-yapacagi-ziyaret-hk.en.mfa', 'https://www.mfa.gov.tr/no_-14_-nepal-de-meydana-gelen-ucak-kazasi-hk.en.mfa', 'https://www.mfa.gov.tr/no_-13_-bosna-hersek-bakanlar-konseyi-baskan-yrd-ve-disisleri-bakani-bisera-turkovic-in-ulkemizi-ziyareti-hk.en.mfa', 'https://www.mfa.gov.tr/no_-12_-turkiye-iran-konsolosluk-istisareleri-hk.en.mfa', 'https://www.mfa.gov.tr/no_-11_-kuzey-kibris-turk-cumhuriyeti-kurucu-cumhurbaskani-sayin-rauf-raif-denktas-in-vefatinin-on-birinci-yildonumu-hk.en.mfa', 'https://www.mfa.gov.tr/no_-10_-italya-basbakan-yardimcisi-ve-disisleri-ve-uluslararasi-isbirligi-bakani-antonio-tajani-nin-ulkemizi-ziyareti-hk.en.mfa', 'https://www.mfa.gov.tr/no_-9_-kirim-tatar-soydaslarimiz-hakkinda-mahkumiyet-karari-verilmesi-hk.en.mfa', 'https://www.mfa.gov.tr/no_-8_-kuzeybati-suriye-ye-yonelik-bm-sinir-otesi-insani-yardim-mekanizmasinin-uzatilmasi-hk.en.mfa', 'https://www.mfa.gov.tr/no_-7_-brezilya-da-devlet-baskani-lula-da-silva-hukumeti-ni-ve-demokratik-kurumlari-hedef-alan-siddet-olaylari-hk.en.mfa', 'https://www.mfa.gov.tr/no_-6_-sudan-daki-gelismeler-hk.en.mfa', 'https://www.mfa.gov.tr/no_-5_-senegal-in-gniby-kentinde-meydana-gelen-kaza-hk.en.mfa', 'https://www.mfa.gov.tr/no_-4_-sayin-bakanimizin-afrika-ziyareti-hk.en.mfa', 'https://www.mfa.gov.tr/no_-3_-deas-teror-orgutu-ile-iltisakli-bir-sebekenin-malvarliklarinin-abd-makamlari-ile-eszamanli-olarak-dondurulmasi-hk.en.mfa', 'https://www.mfa.gov.tr/no_-2_-somali-de-meydana-gelen-teror-saldirisi-hk.en.mfa', 'https://www.mfa.gov.tr/no_-1_-israil-ulusal-guvenlik-bakani-itamar-ben-gvir-in-mescid-i-aksa-ya-baskini--hk.en.mfa', 'https://www.mfa.gov.tr/no_-386_-sayin-bakanimizin-brezilya-yi-ziyareti-hk.en.mfa', 'https://www.mfa.gov.tr/sc_-32_-gkry-nin-dogu-akdeniz-de-devam-eden-hidrokarbon-faaliyetleri-hk-sc.en.mfa', 'https://www.mfa.gov.tr/no_-385_-afganistan-da-yuksekogretimde-kiz-ogrencilere-getirilen-egitim-yasagi-hk.en.mfa', 'https://www.mfa.gov.tr/no_-384_-isvec-disisleri-bakani-tobias-billstrom-un-turkiye-yi-ziyareti-hk.en.mfa', 'https://www.mfa.gov.tr/no_-383_-yemen-cumhuriyeti-disisleri-ve-yurtdisindaki-yemenliler-bakani-dr-ahmed-awad-binmubarak-in-ulkemizi-ziyareti-hk.en.mfa', 'https://www.mfa.gov.tr/no_-382_-gambiya-disisleri-uluslararasi-isbirligi-ve-yurtdisindaki-gambiyalilar-bakani-nin-ulkemizi-ziyareti-hk.en.mfa', 'https://www.mfa.gov.tr/no_-381_-bosna-hersek-e-ab-adaylik-statusu-verilmesi-hk.en.mfa', 'https://www.mfa.gov.tr/no_-380_-turkiye-meksika-ust-duzey-iki-uluslu-komisyonu-siyasi-komitesinin-ikinci-toplantisinin-duzenlenmesi-hk.en.mfa']
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
  • Related