How to scrape all flag images from website using python?-CodePudding

Is there a way to get all the flags from https://en.wikipedia.org/wiki/Gallery_of_sovereign_state_flags using python code?

I tried with pd.read_html and did not succeed. I tried scraping but it got so messy and I couldn't do it.

import requests
from bs4 import BeautifulSoup

page = requests.get("https://en.wikipedia.org/wiki/Gallery_of_sovereign_state_flags")

# Scrap webpage
soup = BeautifulSoup(page.content, 'html.parser')
flags = soup.find_all('a', attrs={'class': "image"})

Would be nice if I can download them to a specific folder too! Thanks in advance!

CodePudding user response：

Just as alternative to yours and the well described approach of MattieTK you could also use css selectors to select your elements more specific:

soup.select('img[src*="/Flag_of"]')

Iterate the ResultSet, pick the src and use a function to download the images:

for e in soup.select('img[src*="/Flag_of"]'):
    download_file('https:' e.get('src'))

Example

import requests
from bs4 import BeautifulSoup

def download_file(url):
    r = requests.get(url, stream=True)
    if r.status_code == 200:
        file_name = url.split('/')[-1]
        with open(file_name,'wb') as f:
            for chunk in r.iter_content(chunk_size=8192):
                f.write(chunk)
    else:
        print('Image Couldn\'t be retrieved',url)

page = requests.get("https://en.wikipedia.org/wiki/Gallery_of_sovereign_state_flags")
soup = BeautifulSoup(page.content)

for e in soup.select('img[src*="/Flag_of"]'):
    download_file('https:' e.get('src'))

CodePudding user response：

In your example flags is an array of anchor tags including the img tags.

What you want is a way to get each individual src attribute from the image tag.

You can achieve this by looping over the results of your soup.find_all like so. Each flag is separate, which allows you to get the contents of the flag (the image tag) and then the value of the src attribute.

for flag in soup.find_all('a', attrs={'class': "image"}):
  src = flag.contents[0]['src'])

You can then work on downloading each of these to a file inside the loop.