Beautifulsoup scraping "lazy faded" images-CodePudding

I am looking for a way to parse the images on a web page. Many posts already exist on the subject, and I was inspired by many of them, in particular : How Can I Download An Image From A Website In Python

The script presented in this post works very well, but I have encountered a type of image that I don't manage to automate the saving. On the website, inspection of the web page gives me:

<img  data-src="Uploads/Media/20220315/1582689.jpg" src="Uploads/Media/20220315/1582689.jpg">

And when I parse the page with Beautifulsoup4, I get this (fonts.gstatic.com Source section content) :

<a  data-size="838x1047" href="Uploads/Media/20220315/1582689.jpg" itemprop="contentUrl">
    <img  data-src="Uploads/Media/20220315/1582689.jpg" />
</a>

The given URL is not a bulk web URL which can be used to download the image from anywhere, but a link to the "Sources" section of the web page (CTRL MAJ I on the webpage), where the image is.

When I put my mouse on the src link of the source code of the website, I can get the true bulk url under "Current source". This information is located in the Elements/Properties of the DevTools (CTRL MAJ I on the webpage), but I don't know how to automate the saving of the images, either by directly using the link to access the web page sources, or to access the bulk address to download the images. Do you have some idea ?

PS : I found this article about lazy fading images, but my HTLM knowledge isn't enough to find a solution for my problem (https://davidwalsh.name/lazyload-image-fade)

Thank you, Have a nice weekend.

CodePudding user response：

I'm not too familiar with web scraping or the benefits. However, I found this article here that you can reference and I hope it helps! Reference

However, here is the code and everything you need in one place.

First you have to find the webpage you want to download the images from, which is your decision.

Now we have to get the urls of the images, create an empty list, open it, select them, loop through them, and then append them.

url = ""
link_list[]
response = urllib.request.urlopen(url)
soup = BeautifulSoup(response, "html.parser")
image_list = soup.select('div.boxmeta.clearfix > h2 > a')
for image_link in image_list:
    link_url = image_link.attrs['href']
    link_list.append(link_url)

This theoretically should look for any href tag linking an image to the website and then append them to that list.

Now we have to get the tags of the image file.

for page_url in link_list:
    page_html = urllib.request.urlopen(page_url)
    page_soup = BeautifulSoup(page_html, "html.parser")
    img_list = page_soup.select('div.seperator > a > img')

This should find all of the div tags that seperate from the primary main div class, look for an a tag and then the img tag.

for img in img_list:
    img_url = (img.attrs['src'])
    file_name = re.search(".*/(.*png|.*jpg)$", img_url)
    save_path = output_folder.joinpath(filename.group(1))

Now we are going to try to download that data using the try except method.

try:
    image = requests.get(img_url)
    open(save_path, 'wb').write(image.content)
    print(save_path)
except ValueError:
    print("ValueError!")

CodePudding user response：

I think you are talking about the relative path and absolute path.

Things like Uploads/Media/20220315/1582689.jpg is a relative path.

The main difference between absolute and relative paths is that absolute URLs always include the domain name of the site with http://www. Relative links show the path to the file or refer to the file itself. A relative URL is useful within a site to transfer a user from point to point within the same domain. --- ref.

So in your case try this to get the list of the images' absolute path:

import requests
from bs4 import BeautifulSoup
from PIL import Image

URL = 'YOUR_URL_HERE'

r = requests.get(URL)
soup = BeautifulSoup(r.text, 'html.parser')

for img in soup.find_all("img"):
    
    # Get the image absolute path url
    absolute_path = requests.compat.urljoin(URL, img.get('data-src'))
    
    # Download the image
    image = Image.open(requests.get(absolute_path, stream=True).raw)
    image.save(absolute_path.split('/')[-1].split('?')[0])