I am trying to scrape all images URLs from any website.
from bs4 import BeautifulSoup
import urllib.request as urllib2
import re
html_page = urllib2.urlopen("http://imgur.com")
soup = BeautifulSoup(html_page, features="html5lib")
images = []
for img in soup.findAll('img', limit=None, recursive=True):
images.append(img.get('src'))
print(images)
It is tutorial code although it seems not to work. I have tried changing parser, setting limits to None, but it always returns only two results while there is plenty of img elems on this wesbite
['https://www.facebook.com/tr?id=742377892535530&ev=PageView&noscript=1', 'https://sb.scorecardresearch.com/p?c1=2&c2=22489583&cv=3.6.0&cj=1']
Could you please tell me how to get all of them?
CodePudding user response:
It is because this web needs javascript to render all content.
If you disable javascript you will see nothing.
You need to use some browser that use javascript for example you can use playwright.
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.webkit.launch(headless=True)
baseurl = "http://imgur.com"
page = browser.new_page()
page.goto(baseurl)
allImages = page.query_selector_all("//img")
print("Total images: " str(len(allImages)))
for img in allImages:
print(img.get_attribute("src"))
browser.close()
OUTPUT:
Total images: 43
https://s.imgur.com/desktop-assets/desktop-assets/icon-new-post.13ab64f9f36ad8f25ae3544b350e2ae1.svg
//s.imgur.com/images/favicon-32x32.png
https://s.imgur.com/desktop-assets/desktop-assets/icon-search.8d0f9b564a4659d48d8eca38b968a7f2.svg
https://s.imgur.com/desktop-assets/desktop-assets/icon-filter.551faed00bcf04e07c9e01a6874bd652.svg
CodePudding user response:
The website might have changed since the tutorial was written. As has been mentioned, you would normally need javascript to run for the HTML to be rendered enabling the full HTML to be seen.
It is not always possible to write code that will work on all websites. In this case you are probably only seeing the social media images at the bottom of the page.
A faster alternative just for this website which does not need javascript is to use your browser to watch it make network requests (debug tools). This shows the browser making a simple URL call to get all the images returned as JSON. This can easily be recreated as follows:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'}
for page in range(1, 3):
req = requests.get(f"https://api.imgur.com/post/v1/posts?client_id=546c25a59c58ad7&filter[section]=eq:hot&include=adtiles,adconfig,cover,viral&location=desktophome&page={page}&sort=-time", headers=headers)
data = req.json()
print(f"Page {page}")
for entry in data:
print(' ', entry['url'])
Giving you output starting:
Page 1
https://imgur.com/gallery/mGHw3vB
https://imgur.com/gallery/K8BTDBm
https://imgur.com/gallery/nGedshr
https://imgur.com/gallery/6J268nt
https://i.imgur.com/P0dUnkW.jpeg
https://imgur.com/gallery/6O7U8Az
https://imgur.com/gallery/wDz9DkU
I suggest you print(data)
to see all the other useful information that is returned.