Home > database >  Can't get all the necessary links from web-page via Selenium
Can't get all the necessary links from web-page via Selenium

Time:11-11

I'm currently trying to use some automation while performing a patent searching task. I'd like to get all the links corresponding to search query result. Particularly, I'm interested in Apple patents starting from the year 2015. So the code is the next one -

import selenium
from selenium import webdriver
from selenium.webdriver.firefox.options import Options as options
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.common.by import By

new_driver_path = r"C:/Users/alexe/Desktop/Apple/PatentSearch/geckodriver-v0.30.0-win64/geckodriver.exe"

ops = options()
serv = Service(new_driver_path)
browser1 = selenium.webdriver.Firefox(service=serv, options=ops)
browser1.get("https://patents.google.com/?assignee=apple&after=priority:20150101&sort=new")

elements = browser1.find_elements(By.CLASS_NAME, "search-result-item")

links = []
for elem in elements:
    href = elem.get_attribute('href')
    if href:
        links.append(href)

links = set(links)
for href in links:
    print(href)

And the output is the next one -

https://patentimages.storage.googleapis.com/ed/06/50/67e30960a7f68d/JP2021152951A.pdf
https://patentimages.storage.googleapis.com/86/30/47/7bc39ddf0e1ea7/KR20210106968A.pdf
https://patentimages.storage.googleapis.com/ca/2a/bc/9380e1657c2767/US20210318798A1.pdf
https://patentimages.storage.googleapis.com/c1/1a/c6/024f785fd5ea10/AU2021204695A1.pdf
https://patentimages.storage.googleapis.com/b3/19/cc/8dc1fae714194f/US20210312694A1.pdf
https://patentimages.storage.googleapis.com/e6/16/c0/292a198e6f1197/AU2021218193A1.pdf
https://patentimages.storage.googleapis.com/3e/77/e0/b59cf47c2b30a1/AU2021212005A1.pdf
https://patentimages.storage.googleapis.com/1b/3d/c2/ad77a8c9724fbc/AU2021204422A1.pdf
https://patentimages.storage.googleapis.com/ad/bc/0f/d1fcc65e53963e/US20210314041A1.pdf

The problem here is that I've got 1 missing link -

result item and the missing link

So I've tried different selectors and still got the same result - one link is missing. I've also tried to search with different parameters and the pattern is the next one - all the missing links aren't linked with pdf output. I've spent a lot of time trying to figure out what's the reason, so I would be really grateful If you could provide me with any clue on the matter. Thanks in advance!

CodePudding user response:

The option highlighted has no a tag with class pdflink in it. Put the line of code to extract the link in try block. If the required element is not found, search for the a tag available for that article.

Try like below once:

driver.get("https://patents.google.com/?assignee=apple&after=priority:20150101&sort=new")

articles = driver.find_elements_by_tag_name("article")

print(len(articles))

for article in articles:
    try:
        link = article.find_element_by_xpath(".//a[contains(@class,'pdfLink')]").get_attribute("href") # Use a dot in the xpath to find an element with in an element.
        print(link)
    except:
        print("Exception")
        link = article.find_element_by_xpath(".//a").get_attribute("href")
        print(link)
10
https://patentimages.storage.googleapis.com/86/30/47/7bc39ddf0e1ea7/KR20210106968A.pdf
https://patentimages.storage.googleapis.com/e6/16/c0/292a198e6f1197/AU2021218193A1.pdf
https://patentimages.storage.googleapis.com/3e/77/e0/b59cf47c2b30a1/AU2021212005A1.pdf
https://patentimages.storage.googleapis.com/c1/1a/c6/024f785fd5ea10/AU2021204695A1.pdf
https://patentimages.storage.googleapis.com/1b/3d/c2/ad77a8c9724fbc/AU2021204422A1.pdf
https://patentimages.storage.googleapis.com/ca/2a/bc/9380e1657c2767/US20210318798A1.pdf
Exception
https://patents.google.com/?assignee=apple&after=priority:20150101&sort=new#
https://patentimages.storage.googleapis.com/b3/19/cc/8dc1fae714194f/US20210312694A1.pdf
https://patentimages.storage.googleapis.com/ed/06/50/67e30960a7f68d/JP2021152951A.pdf
https://patentimages.storage.googleapis.com/ad/bc/0f/d1fcc65e53963e/US20210314041A1.pdf

CodePudding user response:

To extract all the href attributes of the pdfs using Selenium and you have to induce WebDriverWait for visibility_of_all_elements_located() and you can use either of the following Locator Strategies:

  • Using CSS_SELECTOR:

    print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a.search-result-item[href]")))])
    
  • Using XPATH:

    print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//a[contains(@class, 'search-result-item') and @href]")))])
    
  • Console Output:

    ['https://patentimages.storage.googleapis.com/86/30/47/7bc39ddf0e1ea7/KR20210106968A.pdf', 'https://patentimages.storage.googleapis.com/e6/16/c0/292a198e6f1197/AU2021218193A1.pdf', 'https://patentimages.storage.googleapis.com/3e/77/e0/b59cf47c2b30a1/AU2021212005A1.pdf', 'https://patentimages.storage.googleapis.com/c1/1a/c6/024f785fd5ea10/AU2021204695A1.pdf', 'https://patentimages.storage.googleapis.com/1b/3d/c2/ad77a8c9724fbc/AU2021204422A1.pdf', 'https://patentimages.storage.googleapis.com/ca/2a/bc/9380e1657c2767/US20210318798A1.pdf', 'https://patentimages.storage.googleapis.com/b3/19/cc/8dc1fae714194f/US20210312694A1.pdf', 'https://patentimages.storage.googleapis.com/ed/06/50/67e30960a7f68d/JP2021152951A.pdf', 'https://patentimages.storage.googleapis.com/ad/bc/0f/d1fcc65e53963e/US20210314041A1.pdf']
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

PS: You can extract only nine(9) href attributes as one of the search items is a <span> element and isn't a link i.e. doesn't have the href attribute

  • Related