I am very new to web scraping and trying to scrape gif urls from a website. For example, from gifer.com, search gifs for "smile" and then download urls for all gifs listed. Below is an example of the source from which I want to extract src element for the video (https://i.gifer.com/ON0.mp4 in this case).
<div >
<div >
<div >
<span style="color: rgb(255, 255, 255); font-size: 44px;"></span>
</div>
<div style="width: 367.462px;">
<div style="padding-top: 122.462%;">
<div >
<div style="width: 367.462px;">
<div>
<video poster="https://i.gifer.com/fetch/w300-preview/d0/d0e6e89a42c43d31b5913e232d87af7b.gif" loop="" autoplay="" playsinline="">
<source src="https://i.gifer.com/ON0.mp4" type="video/mp4">
</video>
</div>
</div>
</div>
</div>
</div>
<div >
<span style="color: rgb(255, 255, 255); font-size: 44px;">
</span>
</div>
</div>
</div>
There are more than thousands of such results and I was advised to use Python and Selenium. However my knowledge of Selenium and Python is limited I tried below but I am not able to make much headway.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get("https://gifer.com/en/gifs/smile")
imgResults = driver.find_elements(By.CLASS_NAME, "media-container2")
print(len(imgResults))
#print(driver.page_source)
for i in range(0,len(imgResults)):
print(imgResults[i])
driver.quit()
Above returns 4 elements-
<selenium.webdriver.remote.webelement.WebElement (session="fac424650675a90b2a8dee91efdc01f4", element="16e771ca-37d8-45a0-8200-0f03da0b7d14")> <selenium.webdriver.remote.webelement.WebElement (session="fac424650675a90b2a8dee91efdc01f4", element="8c9abdcb-bc9d-47da-9958-109e722b3ae9")> <selenium.webdriver.remote.webelement.WebElement (session="fac424650675a90b2a8dee91efdc01f4", element="d9640144-4ba1-414b-aa4f-5141387335ef")> <selenium.webdriver.remote.webelement.WebElement (session="fac424650675a90b2a8dee91efdc01f4", element="9626db84-1da9-42ad-b314-56222a5e933b")>
Now, how do I grab the source src link for each video element is what I am not getting.
CodePudding user response:
I was wrong, no need to load a new page to get the mp4 link:
for img in driver.find_elements(By.CSS_SELECTOR, "figure a"):
code = img.get_attribute('href').split('/')[-1]
link = f'https://i.gifer.com/{code}.mp4'
print(link)
output
https://i.gifer.com/fzvh.mp4
https://i.gifer.com/7F5y.mp4
https://i.gifer.com/6qOR.mp4
https://i.gifer.com/3JT.mp4
...
You can obtain the list of links in one line
links = [f"https://i.gifer.com/{img.get_attribute('href').split('/')[-1]}.mp4" for img in driver.find_elements(By.CSS_SELECTOR, "figure a")]