I'm working on a webscraping program to collect src links from every image search on https://gibiru.com/
driver.get("https://gibiru.com/")
driver.find_element_by_css_selector('.form-control.has-feedback.has-clear').click()
driver.find_element_by_css_selector('.form-control.has-feedback.has-clear').send_keys("lfc")
driver.find_element_by_css_selector('.form-control.has-feedback.has-clear').send_keys(Keys.RETURN)
driver.find_element(By.XPATH, "/html/body/div[1]/main/div[1]/div/div/div/div[2]").click()
test = driver.find_element(By.XPATH, "//*[@id='___gcse_0']/div/div/div/div[5]/div[2]/div[2]/div/div[1]/div[1]/div[1]/div[1]/div/a/img")
print(str(test))
This is the paths to the image:
Element:
<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRMABCfx3q7rIc6AqY0WSu84w22-PUbEnxkEDqmPqTqNYLrqr0&s" title="Diogo Jota vs Tekkz & Stingrayjnr | 'LFC ePL All-Star Game' - YouTube" alt="Diogo Jota vs Tekkz & Stingrayjnr | 'LFC ePL All-Star Game' - YouTube" >
outerHTML:
<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRMABCfx3q7rIc6AqY0WSu84w22-PUbEnxkEDqmPqTqNYLrqr0&s" title="Diogo Jota vs Tekkz & Stingrayjnr | 'LFC ePL All-Star Game' - YouTube" alt="Diogo Jota vs Tekkz & Stingrayjnr | 'LFC ePL All-Star Game' - YouTube" >
Selector:
#___gcse_0 > div > div > div > div.gsc-wrapper > div.gsc-resultsbox-visible > div.gsc-resultsRoot.gsc-tabData.gsc-tabdActive > div > div.gsc-expansionArea > div:nth-child(1) > div.gs-result.gs-imageResult.gs-imageResult-popup > div.gs-image-thumbnail-box > div > a > img
JS_path:
document.querySelector("#___gcse_0 > div > div > div > div.gsc-wrapper > div.gsc-resultsbox-visible > div.gsc-resultsRoot.gsc-tabData.gsc-tabdActive > div > div.gsc-expansionArea > div:nth-child(1) > div.gs-result.gs-imageResult.gs-imageResult-popup > div.gs-image-thumbnail-box > div > a > img")
Xpath:
//*[@id="___gcse_0"]/div/div/div/div[5]/div[2]/div[2]/div/div[1]/div[1]/div[1]/div[1]/div/a/img
Full_Xpath:
/html/body/div[1]/main/div[2]/div[2]/div/div[1]/div/div/div/div/div[5]/div[2]/div[2]/div/div[1]/div[1]/div[1]/div[1]/div/a/img
This is the tag I want to read the value of src
attribute. My error code says that the test element does not exists.
[
CodePudding user response:
To print the value of the src attribute you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following locator strategies:
Using CSS_SELECTOR:
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "img.gs-image.gs-image-scalable[alt^='Diogo Jota vs Tekkz'][title*='YouTube']"))).get_attribute("src"))
Using XPATH:
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//img[@class='gs-image gs-image-scalable' and starts-with(@alt, 'Diogo Jota vs Tekkz')][contains(@title, 'YouTube')]"))).get_attribute("src"))
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC
You can find a relevant discussion in Python Selenium - get href value
CodePudding user response:
It can be done with requests, gibiru is pulling its results from google:
import requests
import pandas as pd
import json
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.60 Safari/537.17'}
r = requests.get('https://cse.google.com/cse/element/v1?rsz=20&num=20&hl=en&source=gcsc&gss=.com&cselibv=3e1664f444e6eb06&searchtype=image&cx=partner-pub-5956360965567042:9380749580&q=lfc&safe=off&cse_tok=AB1-RNUhB3siCjwzYPYzrx4PNVWU:1658589428907&exp=csqr,cc&callback=google.search.cse.api5143', headers=headers)
df = pd.DataFrame(json.loads(r.text.split('google.search.cse.api5143(')[1].rsplit(');', 1)[0])['results'])
print(df)
This will return ad dataframe with 20 rows × 21 columns:
content contentNoFormatting title titleNoFormatting unescapedUrl url visibleUrl originalContextUrl height width ... tbMedUrl tbLargeUrl tbHeight tbMedHeight tbLargeHeight tbWidth tbMedWidth tbLargeWidth imageId fileFormat
0 https://i.ytimg.com/vi/5KlWCboXwLc/maxresdefau... https://i.ytimg.com/vi/5KlWCboXwLc/maxresdefau... https://i.ytimg.com/vi/5KlWCboXwLc/maxresdefau... https://i.ytimg.com/vi/5KlWCboXwLc/maxresdefau... www.youtube.com https://www.youtube.com/watch?v=5KlWCboXwLc 720 1280 ... https://encrypted-tbn0.gstatic.com/images?q=tb... https://encrypted-tbn0.gstatic.com/images?q=tb... 84 121 168 150 215 300 ANd9GcQgLfQK0oljmpqryiMSQ7LxAL-qowEW2AnTHFii-K... image/jpeg
1 Amazon.com: <b>Liverpool FC</b> Laminated Cres... Amazon.com: Liverpool FC Laminated Crest LFC M... Amazon.com: <b>Liverpool FC</b> Laminated Cres... Amazon.com: Liverpool FC Laminated Crest LFC M... https://m.media-amazon.com/images/I/71KhvpBQtI... https://m.media-amazon.com/images/I/71KhvpBQtI... www.amazon.com https://www.amazon.com/Liverpool-FC-Laminated-... 1200 803 ... https://encrypted-tbn0.gstatic.com/images?q=tb... https://encrypted-tbn0.gstatic.com/images?q=tb... 150 171 275 100 114 184 ANd9GcR3gsdZSUPFlVnvcR93GqUBpaARTwrgVmwMRlwg9i... image/jpeg
2 Highlights: <b>LFC</b> 3-1 CA Osasuna | Firmin... Highlights: LFC 3-1 CA Osasuna | Firmino score... Highlights: <b>LFC</b> 3-1 CA Osasuna | Firmin... Highlights: LFC 3-1 CA Osasuna | Firmino score... https://i.ytimg.com/vi/P4w1-oVWb3U/maxresdefau... https://i.ytimg.com/vi/P4w1-oVWb3U/maxresdefau... www.youtube.com https://www.youtube.com/watch?v=P4w1-oVWb3U 720 1280 ... https://encrypted-tbn0.gstatic.com/images?q=tb... https://encrypted-tbn0.gstatic.com/images?q=tb... 84 121 168 150 215 300 ANd9GcQ_orvU_9vT-rWsWeGEe-gbEQ_VfjSKKqSiOUlMzm... image/jpeg
[....]
Inspect the headers for get request when switching to page 2, 3 etc, to adapt your code to scraping the rest of the pages as well.