Home > OS >  get link after href-Tag with BeautifulSoup
get link after href-Tag with BeautifulSoup

Time:08-21

i want to download all the pictures from this side in high resolution and not the preview pictures:

https://www.booklooker.de/Bücher/Donna-W-Cross Die-Päpstin/id/A02A8f9001ZZl

The link -> https://xxxxx.de to the images i want to download is stored in this part of the html-code: link to the picture

The Code i tried so far was that:

from bs4 import BeautifulSoup
import requests
page = requests.get("https://www.booklooker.de/Bücher/Donna-W-Cross Die-Päpstin/id/A02A8f9001ZZl")

souped = BeautifulSoup(page.content, "html.parser")
for pic in souped.find_all(class_="preview hasXXL"):   
   print(pic['href'])

With that i get to the right part of the code. But i don't get it how to scrape the link after the href-tag. When i want to scarpe it i get that results:

/app/detail.php?id=A02A8f9001ZZl&picNo=1" id="preview_1

But i expect that:

https://images.booklooker.de/x/02Sh07/Donna-W-Cross Die-Päpstin.jpg

What did i do wrong?

Thanks a lot for your help!!

CodePudding user response:

Since you're trying to download images, you may search for the <img> tag and utilise it's src attribute which provides the accurate information.

Your Modified Code:

from bs4 import BeautifulSoup
import requests

page = requests.get("https://www.booklooker.de/Bücher/Donna-W-Cross Die-Päpstin/id/A02A8f9001ZZl")

souped = BeautifulSoup(page.content, "html.parser")
for pic in souped.find_all("img", class_="previewImage"):
    print(pic["src"])

Output:

https://images.booklooker.de/t/02Sh07/Donna-W-Cross Die-Päpstin.jpg
https://images.booklooker.de/t/02Sh08/Donna-W-Cross Die-Päpstin.jpg
...
https://images.booklooker.de/t/02Sh0S/Donna-W-Cross Die-Päpstin.jpg

CodePudding user response:

If you want the image URLs (e.g. https://images.booklooker.de/t/02Sh07/Donna-W-Cross Die-Päpstin.jpg) then you'd need to follow the previewImage elements in the HTML (not the "preview hasXXL" class) and extract the "src" attribute from the img element for the URL.

from bs4 import BeautifulSoup
import requests

url = "https://www.booklooker.de/Bücher/Donna-W-Cross Die-Päpstin/id/A02A8f9001ZZl"
page = requests.get(url)

souped = BeautifulSoup(page.content, "html.parser")
for pic in souped.find_all(class_="previewImage"):   
   # resolve any relative urls to absolute urls using base URL
   src = requests.compat.urljoin(url, pic['src'])
   print(src)

Output:

https://images.booklooker.de/t/02Sh07/Donna-W-Cross Die-Päpstin.jpg
...
https://images.booklooker.de/t/02Sh0S/Donna-W-Cross Die-Päpstin.jpg
  • Related