Home > OS >  How to always scrape the first Link that pops up (no image links)
How to always scrape the first Link that pops up (no image links)

Time:12-02

for i in range(1,len(companynameslist)):
driver.execute_script("window.open('');")
driver.switch_to.window(driver.window_handles[i 1])
driver.get("https://google.com")
driver.minimize_window()
googlebutton = driver.find_element(By.XPATH, '/html/body/div[1]/div[3]/form/div[1]/div[1]/div[1]/div/div[2]')
googlebutton.click()
linkedinsearch = 'site:www.linkedin.com “{}”'.format(companynameslist2[i])
search = driver.find_element(By.XPATH, '/html/body/div[1]/div[3]/form/div[1]/div[1]/div[1]/div/div[2]/input')
search.click()
search.send_keys(linkedinsearch)
search.send_keys(Keys.ENTER)
currenturl = driver.current_url
source2 = requests.get(currenturl).text
soup2 = BeautifulSoup(source2, 'lxml')
links = driver.find_element_by_xpath('//*[@id="rso"]/div[1]/div/div/div[1]/div/a').click()
print(driver.current_url)

Hey guys, this program should scrape the LinkedIn company page of a google search input. i thought this would be the best way to do it (without having to log into LinkedIn), but the problem is that the XPath i used is sometimes invalid, if the google search shows images before the links. I can I skip these images and only scrape the company page?

Any help much appreciated!!!

CodePudding user response:

You need to improve your locators. Absolute XPaths are extremely breakable.
I tested the following code on several company names and it worked correct.

from selenium import webdriver
from selenium.webdriver import Keys
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

options = Options()
options.add_argument("start-maximized")
options.add_argument('--disable-notifications')

webdriver_service = Service('C:\webdrivers\chromedriver.exe')
driver = webdriver.Chrome(options=options, service=webdriver_service)
wait = WebDriverWait(driver, 10)

url = "https://google.com"
driver.get(url)


wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "[name='q']"))).send_keys("Microsoft"   Keys.ENTER)
link = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "#search a[href]"))).get_attribute("href")
print(link)

The output is

https://www.microsoft.com/

So, actually all your code is changed here to 2 lines.
Again, I tested it on several company names. In case there will be exclusions - please let me know and I'll try to check more general solution if needed.

  • Related