Home > database >  Loop through webpages via BeautifulSoup and download all images
Loop through webpages via BeautifulSoup and download all images

Time:06-28

I would like to go through the below web pages and save the respective images using python:

Examples (total of 10.000 websites):

https://cryptopunks.app/cryptopunks/cryptopunk0001.png
https://cryptopunks.app/cryptopunks/cryptopunk0002.png
https://cryptopunks.app/cryptopunks/cryptopunk0002.png
https://cryptopunks.app/cryptopunks/cryptopunk9999.png

My goal is to use the images in a GAN afterward for project work and create images by doing so.

I tried adapting the below code to the above exemplary websites, but unfortunately, I cannot make it work. (Loop through webpages and download all images):

from bs4 import BeautifulSoup as soup
import requests, contextlib, re, os

@contextlib.contextmanager
def get_images(url:str):
  d = soup(requests.get(url).text, 'html.parser') 
  yield [[i.find('img')['src'], re.findall('(?<=\.)\w $', i.find('img')['alt'])[0]] for i in d.find_all('a') if re.findall('/image/\d ', i['href'])]

n = 3 #end value
os.system('mkdir MARCO_images') #added for automation purposes, folder can be named anything, as long as the proper name is used when saving below
for i in range(n):
   with get_images(f'https://marco.ccr.buffalo.edu/images?page={i}&score=Clear') as links:
     print(links)
     for c, [link, ext] in enumerate(links, 1):
        with open(f'MARCO_images/MARCO_img_{i}{c}.{ext}', 'wb') as f:
             f.write(requests.get(f'https://marco.ccr.buffalo.edu{link}').content)

Could anyone please help me out?

Thanks a lot!

CodePudding user response:

I have gone ahead and used only requests, os, the images will be save in the New folder (or any folder you name it). How ever it is rather a slow method to download 9999 images so you can use threading (threads to carry out the calling of the function faster).

import requests
import os
import threading

os.mkdir("New folder")


def get_images(url, index):
    r = requests.get(url)

    with open(f"New folder\image_{index}.png", "wb") as img:
        img.write(r.content)
    img.close()


n = 10000
for i in range(1, n):
    t1 = threading.Thread(target=get_images, args=(f"https://cryptopunks.app/cryptopunks/cryptopunk{i}.png", i))
    t1.start() 
    # As you know the website you can easily access it and by just providing the number you can download the images
    # the loop will run from 1 to 9999 as you wanted.

As for my computer it took the program about 7 seconds to download 9 images without using threads, and it only took the program about 2 seconds to download 9 images with using threads. So using threads can do multiprocessing.

  • Related