I need to collect all links from a webpage as seen below (25 links from each 206 pages, around 5200 total links), which also has a load more news button (as three dots). I wrote my script, but my script does not give any links that I tried to collect. I updated some of Selenium attributes. I really don't know why I could not get all the links.
from selenium import webdriver
from bs4 import BeautifulSoup
import time
from selenium.webdriver.common.by import By
from selenium.webdriver import Chrome
#Initialize the Chrome driver
driver = webdriver.Chrome()
driver.get("https://www.mfa.gov.tr/sub.en.mfa?ad9093da-8e71-4678-a1b6-05f297baadc4")
page_count = driver.find_element(By.XPATH, "//span[@class='rgInfoPart']")
text = page_count.text
page_count = int(text.split()[-1])
links = []
for i in range(1, page_count 1):
# Click on the page number
driver.find_element(By.XPATH, f"//a[text()='{i}']").click()
time.sleep(5)
# Wait for the page to load
soup = BeautifulSoup(driver.page_source, 'html.parser')
# Extract the links from the page
page_links = soup.find_all('div', {'class': 'sub_lstitm'})
for link in page_links:
links.append("https://www.mfa.gov.tr" link.find('a')['href'])
time.sleep(5)
driver.quit()
print(links)
I tried to run my code but actually I couldn't. I need to have some solution for this.
CodePudding user response:
You can easily do everything in Selenium using the following method:
- Wait for the links to be visible on the page
- Get titles and urls
- Get the current page number
- If there is the button for the next page, then click it and repeat from step 1., otherwise it means we are in the last page hence the execution ends
Add these imports
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
and then run the following code
driver.get("https://www.mfa.gov.tr/sub.en.mfa?ad9093da-8e71-4678-a1b6-05f297baadc4")
titles, urls = [], []
while 1:
links = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.sub_lstitm a")))
for link in links:
titles.append( link.text )
urls.append( link.get_attribute('href') )
current_page = int(driver.find_element(By.CSS_SELECTOR, 'td span').text)
print('current page:', current_page, end='\r')
try:
next_page_button = driver.find_element(By.XPATH, f'//a[text()={current_page 1}]')
except:
print('current page is the last one')
break
next_page_button.click()
for i in range(len(titles)):
print(titles[i],'\n',urls[i],'\n')
Output
No: 17, 18 January 2023, Press Release Regarding the Consular Consultations Between Türkiye and Yemen
https://www.mfa.gov.tr/no_-17_-turkiye-ve-yemen-arasinda-gerceklestirilecek-konsolosluk-istisareleri-hk.en.mfa
No: 16, 17 January 2023, Press Release Regarding the Visit of H.E. Mr. Mevlüt Çavuşoğlu, Minister of Foreign Affairs of the Republic of Türkiye, to the U.S.
https://www.mfa.gov.tr/no_-16_-sayin-bakanimizin-abd-ziyareti-hk.en.mfa
No: 15, 16 January 2023, Press Release Regarding the Visit of H.E. Mr. Hossein Amir Abdollahian, Minister of Foreign Affairs of the Islamic Republic of Iran, to Türkiye
https://www.mfa.gov.tr/no_-15_-iran-islam-cumhuriyeti-disisleri-bakani-huseyin-emir-abdullahiyan-in-ulkemize-yapacagi-ziyaret-hk.en.mfa
No: 14, 15 January 2023, Press Release Regarding the Plane Crash in Nepal
https://www.mfa.gov.tr/no_-14_-nepal-de-meydana-gelen-ucak-kazasi-hk.en.mfa
No: 13, 14 January 2023, Press Release Regarding the Visit of H.E. Ms. Bisera Turkovic, Deputy Chairperson of the Council of Ministers and Minister of Foreign Affairs of Bosnia and Herzegovina, to Türkiye
https://www.mfa.gov.tr/no_-13_-bosna-hersek-bakanlar-konseyi-baskan-yrd-ve-disisleri-bakani-bisera-turkovic-in-ulkemizi-ziyareti-hk.en.mfa
...
CodePudding user response:
Using only Selenium you can easily collect all links from the webpage inducing WebDriverWait for visibility_of_all_elements_located() and you can use either of the following locator strategies:
Using CSS_SELECTOR:
driver.get('https://www.mfa.gov.tr/sub.en.mfa?ad9093da-8e71-4678-a1b6-05f297baadc4') print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.sub_lstitm > a")))])
Using XPATH:
driver.get('https://www.mfa.gov.tr/sub.en.mfa?ad9093da-8e71-4678-a1b6-05f297baadc4') print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='sub_lstitm']/a")))])
Console Output:
['https://www.mfa.gov.tr/no_-17_-turkiye-ve-yemen-arasinda-gerceklestirilecek-konsolosluk-istisareleri-hk.en.mfa', 'https://www.mfa.gov.tr/no_-16_-sayin-bakanimizin-abd-ziyareti-hk.en.mfa', 'https://www.mfa.gov.tr/no_-15_-iran-islam-cumhuriyeti-disisleri-bakani-huseyin-emir-abdullahiyan-in-ulkemize-yapacagi-ziyaret-hk.en.mfa', 'https://www.mfa.gov.tr/no_-14_-nepal-de-meydana-gelen-ucak-kazasi-hk.en.mfa', 'https://www.mfa.gov.tr/no_-13_-bosna-hersek-bakanlar-konseyi-baskan-yrd-ve-disisleri-bakani-bisera-turkovic-in-ulkemizi-ziyareti-hk.en.mfa', 'https://www.mfa.gov.tr/no_-12_-turkiye-iran-konsolosluk-istisareleri-hk.en.mfa', 'https://www.mfa.gov.tr/no_-11_-kuzey-kibris-turk-cumhuriyeti-kurucu-cumhurbaskani-sayin-rauf-raif-denktas-in-vefatinin-on-birinci-yildonumu-hk.en.mfa', 'https://www.mfa.gov.tr/no_-10_-italya-basbakan-yardimcisi-ve-disisleri-ve-uluslararasi-isbirligi-bakani-antonio-tajani-nin-ulkemizi-ziyareti-hk.en.mfa', 'https://www.mfa.gov.tr/no_-9_-kirim-tatar-soydaslarimiz-hakkinda-mahkumiyet-karari-verilmesi-hk.en.mfa', 'https://www.mfa.gov.tr/no_-8_-kuzeybati-suriye-ye-yonelik-bm-sinir-otesi-insani-yardim-mekanizmasinin-uzatilmasi-hk.en.mfa', 'https://www.mfa.gov.tr/no_-7_-brezilya-da-devlet-baskani-lula-da-silva-hukumeti-ni-ve-demokratik-kurumlari-hedef-alan-siddet-olaylari-hk.en.mfa', 'https://www.mfa.gov.tr/no_-6_-sudan-daki-gelismeler-hk.en.mfa', 'https://www.mfa.gov.tr/no_-5_-senegal-in-gniby-kentinde-meydana-gelen-kaza-hk.en.mfa', 'https://www.mfa.gov.tr/no_-4_-sayin-bakanimizin-afrika-ziyareti-hk.en.mfa', 'https://www.mfa.gov.tr/no_-3_-deas-teror-orgutu-ile-iltisakli-bir-sebekenin-malvarliklarinin-abd-makamlari-ile-eszamanli-olarak-dondurulmasi-hk.en.mfa', 'https://www.mfa.gov.tr/no_-2_-somali-de-meydana-gelen-teror-saldirisi-hk.en.mfa', 'https://www.mfa.gov.tr/no_-1_-israil-ulusal-guvenlik-bakani-itamar-ben-gvir-in-mescid-i-aksa-ya-baskini--hk.en.mfa', 'https://www.mfa.gov.tr/no_-386_-sayin-bakanimizin-brezilya-yi-ziyareti-hk.en.mfa', 'https://www.mfa.gov.tr/sc_-32_-gkry-nin-dogu-akdeniz-de-devam-eden-hidrokarbon-faaliyetleri-hk-sc.en.mfa', 'https://www.mfa.gov.tr/no_-385_-afganistan-da-yuksekogretimde-kiz-ogrencilere-getirilen-egitim-yasagi-hk.en.mfa', 'https://www.mfa.gov.tr/no_-384_-isvec-disisleri-bakani-tobias-billstrom-un-turkiye-yi-ziyareti-hk.en.mfa', 'https://www.mfa.gov.tr/no_-383_-yemen-cumhuriyeti-disisleri-ve-yurtdisindaki-yemenliler-bakani-dr-ahmed-awad-binmubarak-in-ulkemizi-ziyareti-hk.en.mfa', 'https://www.mfa.gov.tr/no_-382_-gambiya-disisleri-uluslararasi-isbirligi-ve-yurtdisindaki-gambiyalilar-bakani-nin-ulkemizi-ziyareti-hk.en.mfa', 'https://www.mfa.gov.tr/no_-381_-bosna-hersek-e-ab-adaylik-statusu-verilmesi-hk.en.mfa', 'https://www.mfa.gov.tr/no_-380_-turkiye-meksika-ust-duzey-iki-uluslu-komisyonu-siyasi-komitesinin-ikinci-toplantisinin-duzenlenmesi-hk.en.mfa']
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC