Web scraping images are in a unsupported format-CodePudding

I have been trying to scrape some images using Beautifulsoup in Python and I am facing some problems, so the thing is that I am successfully able to scrape the link as well as store it in the folder but the images are in an unsupported format.


res = requests.get('https://books.toscrape.com/')
res.raise_for_status()
file = open('op.html', 'wb')
for i in res.iter_content(10000):
        file.write(i)


os.makedirs('images', exist_ok=True)
newfile=open("op.html",'rb')
data=newfile.read()
soup=BeautifulSoup(data,'html.parser')
for link in soup.find_all('img'):
    ll=link.get('src')

    ima = open(os.path.join('images', os.path.basename(ll)), 'wb')
    for down in res.iter_content(1000):
        ima.write(down)

It says file format not supported even though it's in a jpeg format output image in a folder

CodePudding user response：

This line for down in res.iter_content(1000): is not iterating the image from ll - it is reiterating the html result. Your OS may recognize the file from the extension (.jpeg), but this is only because of the filename - not the content (which is not JPEG, but HTML, and hence the error).

You should make another request for the image itself, so it can be fetched and stored:

for link in soup.find_all('img'):
    ll = link.get('src')
    img_rs = requests.get(os.path.join('https://books.toscrape.com/', ll))  # <-- this line

    ima = open(os.path.join('images', os.path.basename(ll)), 'wb')
    for down in img_rs.iter_content(1000):  # <-- and iterate on the result
        ima.write(down)

CodePudding user response：

Your problem is that after you find the URL of the image you don't do anything with it and instead you try to save the whole inital request which is just the html file of the whole website. Try something like this instead:

base_url = 'https://books.toscrape.com/'
res = requests.get('https://books.toscrape.com/')
res.raise_for_status()
file = open('op.html', 'wb')
for i in res.iter_content(10000):
    file.write(i)


os.makedirs('images', exist_ok=True)
newfile=open("op.html",'rb')
data=newfile.read()
soup=BeautifulSoup(data,'html.parser')
for link in soup.find_all('img'):
    ll=link.get('src')

    ima = os.path.join('images', os.path.basename(ll))
    current_img = os.path.join(base_url, ll)
    img_res = requests.get(current_img, stream = True)
    with open(ima, 'wb') as f:
        shutil.copyfileobj(img_res.raw, f)

del img_res

CodePudding user response：

The reason for saving the HTML is obscure. So, ignoring that part of the code in question, it comes down to this:

import requests
from os.path import join, basename
from bs4 import BeautifulSoup as BS
from urllib.parse import urljoin

URL = 'https://books.toscrape.com'
TARGET_DIR = '/tmp'

with requests.Session() as session:
    (r := session.get(URL)).raise_for_status()
    for image in BS(r.text, 'lxml').find_all('img'):
        src = image['src']
        (r := session.get(urljoin(URL, src), stream=True)).raise_for_status()
        with open(join(TARGET_DIR, basename(src)), 'wb') as t:
            for chunk in r.iter_content(chunk_size=8192):
                t.write(chunk)

In terms of performance, this can be significantly enhanced by multithreading