Home > Enterprise >  Unable to extract some links in a webpage using BeautifulSoup
Unable to extract some links in a webpage using BeautifulSoup

Time:06-24

I am trying to scrape images out of this webpage. It has a lot of images, linking to new pages, and I want to click through each and extract the images from the child webpage. For this, first, I need a list of all 'links' in the original page. I have the following code -

# import necessary libraries
from bs4 import BeautifulSoup
import requests
import re


# function to extract html document from given url
def getHTMLdocument(url):
    # request for HTML document of given url
    response = requests.get(url)

    # response will be provided in JSON format
    return response.text


# assign required credentials
# assign URL
url_to_scrape = "https://www.wikiart.org/en/paintings-by-style/magic-realism?select=featured#!#filterName:featured,viewType:masonry"

# create document
html_document = getHTMLdocument(url_to_scrape)

# create soap object
soup = BeautifulSoup(html_document, 'html.parser')

# find all the anchor tags with "href"
# attribute starting with "https://"
for link in soup.find_all('a', attrs={'href': re.compile("^https://")}):
    # display the actual urls
    print(link.get('href'))

However, this only gives me this following list -

https://www.globalcitizen.org/en/content/ways-to-help-ukraine-conflict/
https://www.wikiart.org/en/giovanni-bellini/leonardo-loredan-1501-1
https://www.1st-art-gallery.com/
https://www.wikiart.org/en/giovanni-bellini/leonardo-loredan-1501-1
https://www.facebook.com/wikiart.org
https://twitter.com/wikipaintings
https://www.1st-art-gallery.com/
https://www.1st-art-gallery.com/
https://wikiart.uservoice.com
https://itunes.apple.com/us/app/wikiart/id1235995167
https://play.google.com/store/apps/details?id=com.ilit.wikipaintings
https://www.facebook.com/wikiart.org
https://twitter.com/wikipaintings

These are all links that are there in the webpage, but it's missing the links that would be generated if you were to click on one of the images. Clicking on the images redirects you to another standard page of the https:// so I am not sure what I am missing here.

I can see that the images are not 'normal links' because if I click on them with ctrl, the new page opens up in the same tab as opposed to a new one. I am guessing that is related to why those does not show up in BeautifulSoup? But I do not know what those types of links are called, so I don't know what to search for.

CodePudding user response:

The links to the images are embedded in the HTML source, but you need to get them out first. Then, once you have the image source urls, you can download them, if you feel like it.

Here's how:

import json
import re

import requests
from bs4 import BeautifulSoup

url = "https://www.wikiart.org/en/paintings-by-style/magic-realism?select=featured#!#filterName:featured,viewType:masonry"

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Firefox/101.0",
}

soup = (
    BeautifulSoup(requests.get(url, headers=headers).text, "lxml")
    .find("div", class_="artworks-by-dictionary")["ng-init"]
)

images = [
    i["image"] for i in json.loads(re.search(r":\s(\[.*\])", soup).group(1))
]

print("\n".join(images))

Output:

https://uploads1.wikiart.org/images/felix-vallotton/portrait-of-thadee-nathanson-1897.jpg
https://uploads1.wikiart.org/images/felix-vallotton/the-source-1897.jpg
https://uploads3.wikiart.org/00236/images/telemaco-signorini/pag026.jpg
https://uploads6.wikiart.org/images/felix-vallotton/laid-down-woman-sleeping-1899.jpg
https://uploads6.wikiart.org/images/felix-vallotton/sunset-1910.jpg
https://uploads8.wikiart.org/images/felix-vallotton/red-sand-and-snow-1901.jpg
https://uploads2.wikiart.org/images/felix-vallotton/the-pier-of-honfleur-1901.jpg
https://uploads4.wikiart.org/images/felix-vallotton/the-pont-neuf-1901.jpg
https://uploads7.wikiart.org/images/pierre-roy/les-mauvaises-graines-1901.jpg
https://uploads2.wikiart.org/images/felix-vallotton/the-way-to-locquirec-1902.jpg
https://uploads8.wikiart.org/images/felix-vallotton/the-five-painters-1902.jpg
https://uploads1.wikiart.org/images/felix-vallotton/the-toilet-1905.jpg

and more...

EDIT:

Actually, there's an even easier way to get that data:

import json

import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Firefox/101.0",
    "X-Requested-With": "XMLHttpRequest",
    "Referer": "https://www.wikiart.org/en/paintings-by-style/magic-realism?select=featured"
}

url = "https://www.wikiart.org/en/paintings-by-style/magic-realism?select=featured&json=2&layout=new&page=1&resultType=masonry"
paintings = requests.get(url, headers=headers).json()["Paintings"]

for painting in paintings:
    print(f"{painting['artistName']} - {painting['title']}\n{painting['image']}")

Output:

Felix Vallotton - Portrait of Thadee Nathanson
https://uploads1.wikiart.org/images/felix-vallotton/portrait-of-thadee-nathanson-1897.jpg
Felix Vallotton - The Source
https://uploads1.wikiart.org/images/felix-vallotton/the-source-1897.jpg
Telemaco Signorini - The morning toilet
https://uploads3.wikiart.org/00236/images/telemaco-signorini/pag026.jpg
Felix Vallotton - Laid down woman, sleeping
https://uploads6.wikiart.org/images/felix-vallotton/laid-down-woman-sleeping-1899.jpg
Felix Vallotton - Sunset
https://uploads6.wikiart.org/images/felix-vallotton/sunset-1910.jpg
Felix Vallotton - Red Sand and Snow
https://uploads8.wikiart.org/images/felix-vallotton/red-sand-and-snow-1901.jpg
Felix Vallotton - The pier of Honfleur
https://uploads2.wikiart.org/images/felix-vallotton/the-pier-of-honfleur-1901.jpg

and more ...

BONUS

By incrementing the page value in the URL you can paginate the search.

  • Related