Home > Back-end >  How to get the value of src attribute from Gibiru image-search via Selenium
How to get the value of src attribute from Gibiru image-search via Selenium

Time:07-24

I'm working on a webscraping program to collect src links from every image search on https://gibiru.com/

driver.get("https://gibiru.com/")
driver.find_element_by_css_selector('.form-control.has-feedback.has-clear').click()
driver.find_element_by_css_selector('.form-control.has-feedback.has-clear').send_keys("lfc")
driver.find_element_by_css_selector('.form-control.has-feedback.has-clear').send_keys(Keys.RETURN)
driver.find_element(By.XPATH, "/html/body/div[1]/main/div[1]/div/div/div/div[2]").click()
test = driver.find_element(By.XPATH, "//*[@id='___gcse_0']/div/div/div/div[5]/div[2]/div[2]/div/div[1]/div[1]/div[1]/div[1]/div/a/img")
print(str(test))

This is the paths to the image:

Element:

<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRMABCfx3q7rIc6AqY0WSu84w22-PUbEnxkEDqmPqTqNYLrqr0&amp;s" title="Diogo Jota vs Tekkz &amp; Stingrayjnr | 'LFC ePL All-Star Game' - YouTube" alt="Diogo Jota vs Tekkz &amp; Stingrayjnr | 'LFC ePL All-Star Game' - YouTube" >

outerHTML:

<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRMABCfx3q7rIc6AqY0WSu84w22-PUbEnxkEDqmPqTqNYLrqr0&amp;s" title="Diogo Jota vs Tekkz &amp; Stingrayjnr | 'LFC ePL All-Star Game' - YouTube" alt="Diogo Jota vs Tekkz &amp; Stingrayjnr | 'LFC ePL All-Star Game' - YouTube" >

Selector:

#___gcse_0 > div > div > div > div.gsc-wrapper > div.gsc-resultsbox-visible > div.gsc-resultsRoot.gsc-tabData.gsc-tabdActive > div > div.gsc-expansionArea > div:nth-child(1) > div.gs-result.gs-imageResult.gs-imageResult-popup > div.gs-image-thumbnail-box > div > a > img 

JS_path:

document.querySelector("#___gcse_0 > div > div > div > div.gsc-wrapper > div.gsc-resultsbox-visible > div.gsc-resultsRoot.gsc-tabData.gsc-tabdActive > div > div.gsc-expansionArea > div:nth-child(1) > div.gs-result.gs-imageResult.gs-imageResult-popup > div.gs-image-thumbnail-box > div > a > img")

Xpath:

//*[@id="___gcse_0"]/div/div/div/div[5]/div[2]/div[2]/div/div[1]/div[1]/div[1]/div[1]/div/a/img

Full_Xpath:

/html/body/div[1]/main/div[2]/div[2]/div/div[1]/div/div/div/div/div[5]/div[2]/div[2]/div/div[1]/div[1]/div[1]/div[1]/div/a/img

This is the tag I want to read the value of src attribute. My error code says that the test element does not exists.

[HTML tag1

CodePudding user response:

To print the value of the src attribute you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following locator strategies:

  • Using CSS_SELECTOR:

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "img.gs-image.gs-image-scalable[alt^='Diogo Jota vs Tekkz'][title*='YouTube']"))).get_attribute("src"))
    
  • Using XPATH:

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//img[@class='gs-image gs-image-scalable' and starts-with(@alt, 'Diogo Jota vs Tekkz')][contains(@title, 'YouTube')]"))).get_attribute("src"))
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

You can find a relevant discussion in Python Selenium - get href value

CodePudding user response:

It can be done with requests, gibiru is pulling its results from google:

import requests
import pandas as pd
import json

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.60 Safari/537.17'}

r = requests.get('https://cse.google.com/cse/element/v1?rsz=20&num=20&hl=en&source=gcsc&gss=.com&cselibv=3e1664f444e6eb06&searchtype=image&cx=partner-pub-5956360965567042:9380749580&q=lfc&safe=off&cse_tok=AB1-RNUhB3siCjwzYPYzrx4PNVWU:1658589428907&exp=csqr,cc&callback=google.search.cse.api5143', headers=headers)
df = pd.DataFrame(json.loads(r.text.split('google.search.cse.api5143(')[1].rsplit(');', 1)[0])['results'])
print(df)

This will return ad dataframe with 20 rows × 21 columns:

    content contentNoFormatting title   titleNoFormatting   unescapedUrl    url visibleUrl  originalContextUrl  height  width   ... tbMedUrl    tbLargeUrl  tbHeight    tbMedHeight tbLargeHeight   tbWidth tbMedWidth  tbLargeWidth    imageId fileFormat
0           https://i.ytimg.com/vi/5KlWCboXwLc/maxresdefau...   https://i.ytimg.com/vi/5KlWCboXwLc/maxresdefau...   https://i.ytimg.com/vi/5KlWCboXwLc/maxresdefau...   https://i.ytimg.com/vi/5KlWCboXwLc/maxresdefau...   www.youtube.com https://www.youtube.com/watch?v=5KlWCboXwLc 720 1280    ... https://encrypted-tbn0.gstatic.com/images?q=tb...   https://encrypted-tbn0.gstatic.com/images?q=tb...   84  121 168 150 215 300 ANd9GcQgLfQK0oljmpqryiMSQ7LxAL-qowEW2AnTHFii-K...   image/jpeg
1   Amazon.com: <b>Liverpool FC</b> Laminated Cres...   Amazon.com: Liverpool FC Laminated Crest LFC M...   Amazon.com: <b>Liverpool FC</b> Laminated Cres...   Amazon.com: Liverpool FC Laminated Crest LFC M...   https://m.media-amazon.com/images/I/71KhvpBQtI...   https://m.media-amazon.com/images/I/71KhvpBQtI...   www.amazon.com  https://www.amazon.com/Liverpool-FC-Laminated-...   1200    803 ... https://encrypted-tbn0.gstatic.com/images?q=tb...   https://encrypted-tbn0.gstatic.com/images?q=tb...   150 171 275 100 114 184 ANd9GcR3gsdZSUPFlVnvcR93GqUBpaARTwrgVmwMRlwg9i...   image/jpeg
2   Highlights: <b>LFC</b> 3-1 CA Osasuna | Firmin...   Highlights: LFC 3-1 CA Osasuna | Firmino score...   Highlights: <b>LFC</b> 3-1 CA Osasuna | Firmin...   Highlights: LFC 3-1 CA Osasuna | Firmino score...   https://i.ytimg.com/vi/P4w1-oVWb3U/maxresdefau...   https://i.ytimg.com/vi/P4w1-oVWb3U/maxresdefau...   www.youtube.com https://www.youtube.com/watch?v=P4w1-oVWb3U 720 1280    ... https://encrypted-tbn0.gstatic.com/images?q=tb...   https://encrypted-tbn0.gstatic.com/images?q=tb...   84  121 168 150 215 300 ANd9GcQ_orvU_9vT-rWsWeGEe-gbEQ_VfjSKKqSiOUlMzm...   image/jpeg
[....]

Inspect the headers for get request when switching to page 2, 3 etc, to adapt your code to scraping the rest of the pages as well.

  • Related