Home > Net >  Selenium and Python to scrape a library search engine
Selenium and Python to scrape a library search engine

Time:07-26

I am trying to scrape from a website we own tile, link and abstracts of articles in a search engine. I was earlier trying to use google sheet for this, but as this is a dynamic website I was encouraged to try with selenium and python. However I am getting nowhere. I am trying to scrape content from https://resources.norrag.org/categories/591,595 and wish to return the title and links of two case studies.

    from selenium import webdriver
    from selenium.webdriver.chrome.service import Service
    
    s=Service('C:/Users/xxxx/Downloads/chromedriver_win32/chromedriver.exe')
    browser = webdriver.Chrome(service=s)
    url='https://resources.norrag.org/categories/591,595'
    browser.get(url)
    
    element = driver.find_element("xpath", '//div[@id="article_search_results"]//a')
    
    print(element)
    driver.close()

here is the error message

> --------------------------------------------------------------------------- NoSuchElementException                    Traceback (most recent call
> last) Input In [8], in <cell line: 10>()
>       6 url='https://resources.norrag.org/categories/591,595'
>       7 driver.get(url)
> ---> 10 element = driver.find_element("xpath", '//div[@id="article_search_results"]//a')
>      12 print(element)
>      13 driver.close()
> 
> File
> ~\Anaconda3\lib\site-packages\selenium\webdriver\remote\webdriver.py:857,
> in WebDriver.find_element(self, by, value)
>     854     by = By.CSS_SELECTOR
>     855     value = '[name="%s"]' % value
> --> 857 return self.execute(Command.FIND_ELEMENT, {
>     858     'using': by,
>     859     'value': value})['value']
> 
> File
> ~\Anaconda3\lib\site-packages\selenium\webdriver\remote\webdriver.py:435,
> in WebDriver.execute(self, driver_command, params)
>     433 response = self.command_executor.execute(driver_command, params)
>     434 if response:
> --> 435     self.error_handler.check_response(response)
>     436     response['value'] = self._unwrap_value(
>     437         response.get('value', None))
>     438     return response
> 
> File
> ~\Anaconda3\lib\site-packages\selenium\webdriver\remote\errorhandler.py:247,
> in ErrorHandler.check_response(self, response)
>     245         alert_text = value['alert'].get('text')
>     246     raise exception_class(message, screen, stacktrace, alert_text)  # type: ignore[call-arg]  # mypy is not smart enough here
> --> 247 raise exception_class(message, screen, stacktrace)
> 
> NoSuchElementException: Message: no such element: Unable to locate
> element:
> {"method":"xpath","selector":"//div[@id="article_search_results"]//a"}
> (Session info: chrome=103.0.5060.114) Stacktrace: Backtrace:  Ordinal0
> [0x00575FD3 2187219]  Ordinal0 [0x0050E6D1 1763025]   Ordinal0
> [0x00423E78 802424]   Ordinal0 [0x00451C10 990224]    Ordinal0
> [0x00451EAB 990891]   Ordinal0 [0x0047EC92 1174674]   Ordinal0
> [0x0046CBD4 1100756]  Ordinal0 [0x0047CFC2 1167298]   Ordinal0
> [0x0046C9A6 1100198]  Ordinal0 [0x00446F80 946048]    Ordinal0
> [0x00447E76 949878]   GetHandleVerifier [0x008190C2 2721218]
>   GetHandleVerifier [0x0080AAF0 2662384]  GetHandleVerifier
> [0x0060137A 526458]   GetHandleVerifier [0x00600416 522518]   Ordinal0
> [0x00514EAB 1789611]  Ordinal0 [0x005197A8 1808296]   Ordinal0
> [0x00519895 1808533]  Ordinal0 [0x005226C1 1844929]
>   BaseThreadInitThunk [0x76B5FA29 25]
>   RtlGetAppContainerNamedObjectPath [0x77007A9E 286]
>   RtlGetAppContainerNamedObjectPath [0x77007A6E 238]

CodePudding user response:

Looking at the source code using the inspection tool you can see that the two links have the class library-document-summary So searching for these elements and returning their text and href attribute should work:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service  
   
s=Service('C:/Users/xxxx/Downloads/chromedriver_win32/chromedriver.exe')
driver = webdriver.Chrome(service=s)
url='https://resources.norrag.org/categories/591,595'
driver.get(url)

elements = driver.find_elements(By.XPATH, '//a[@]')

for e in elements:
    print(e.get_attribute("href"))
    print(e.text)

yields

https://resources.norrag.org/resource/696/towards-better-skills-development-in-the-vietnam-2018-general-education-curriculum
Towards Better Skills Development in the Vietnam 2018 General Education Curriculum
https://resources.norrag.org/resource/577/vietnam-national-education-for-all-efa-action-plan-2003-2015
Vietnam National Education for All (EFA) Action Plan 2003 - 2015
  • Related