Is there a way to get all the flags from https://en.wikipedia.org/wiki/Gallery_of_sovereign_state_flags using python code?
I tried with pd.read_html
and did not succeed. I tried scraping but it got so messy and I couldn't do it.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://en.wikipedia.org/wiki/Gallery_of_sovereign_state_flags")
# Scrap webpage
soup = BeautifulSoup(page.content, 'html.parser')
flags = soup.find_all('a', attrs={'class': "image"})
Would be nice if I can download them to a specific folder too! Thanks in advance!
CodePudding user response:
Just as alternative to yours and the well described approach of MattieTK you could also use css selectors
to select your elements more specific:
soup.select('img[src*="/Flag_of"]')
Iterate the ResultSet
, pick the src
and use a function to download the images:
for e in soup.select('img[src*="/Flag_of"]'):
download_file('https:' e.get('src'))
Example
import requests
from bs4 import BeautifulSoup
def download_file(url):
r = requests.get(url, stream=True)
if r.status_code == 200:
file_name = url.split('/')[-1]
with open(file_name,'wb') as f:
for chunk in r.iter_content(chunk_size=8192):
f.write(chunk)
else:
print('Image Couldn\'t be retrieved',url)
page = requests.get("https://en.wikipedia.org/wiki/Gallery_of_sovereign_state_flags")
soup = BeautifulSoup(page.content)
for e in soup.select('img[src*="/Flag_of"]'):
download_file('https:' e.get('src'))
CodePudding user response:
In your example flags
is an array of anchor tags including the img
tags.
What you want is a way to get each individual src
attribute from the image tag.
You can achieve this by looping over the results of your soup.find_all
like so. Each flag is separate, which allows you to get the contents of the flag (the image tag) and then the value of the src attribute.
for flag in soup.find_all('a', attrs={'class': "image"}):
src = flag.contents[0]['src'])
You can then work on downloading each of these to a file inside the loop.