Home > Back-end >  Scrapping Image URLs with beautifusoup
Scrapping Image URLs with beautifusoup

Time:05-30

Trying to learn something today and doing a bit of scrapping.

I am trying to list product names and corresponding image URLs into a spreadsheet.

I managed to store the names but the images dont seem to work. Hopefully you can help!

Here is the code I use for extracting the text:

results[0].find('p', {'class': 'product-card__name'}).get_text()

Here is what I thought would extract the image:

results[0].find('img', {'class':'product-card__image'}).get_src()

This is obvioulsy not working.Returning that "'NoneType' object is not callable"

Any pointers?

For reference, below is the source I am trying to scrape.

<li ><a href="/p/63818/bumbu-the-original-rum-glass-pack"  title=" Bumbu The Original Rum Glass Pack" onclick="_gaq.push(['_trackEvent', 'Products-GridView', 'click', '63818 : Bumbu The Original Rum / Glass Pack'])"><div ><img src="https://img.thewhiskyexchange.com/480/rum_bum4.jpg" alt="Bumbu The Original Rum Glass Pack"  loading="lazy" width="3" height="4"></div><div ><p > Bumbu The Original Rum<span >Glass Pack</span></p><p > 70cl / 40% </p></div><div ><p > £39.95 </p><p > (£57.07 per litre) </p></div></a></li>

CodePudding user response:

To grab the image url, you have to call .get('src') instead of .get_src()

results[0].find('img', {'class':'product-card__image'}).get('src')

Example:

html='''
<li >
 <a  href="/p/63818/bumbu-the-original-rum-glass-pack" onclick="_gaq.push(['_trackEvent', 'Products-GridView', 'click', '63818 : Bumbu The Original Rum / Glass Pack'])" title=" Bumbu The Original Rum Glass Pack">
  <div >
   <img alt="Bumbu The Original Rum Glass Pack"  height="4" loading="lazy" src="https://img.thewhiskyexchange.com/480/rum_bum4.jpg" width="3"/>
  </div>
  <div >
   <p >
    Bumbu The Original Rum
    <span >
     Glass Pack
    </span>
   </p>
   <p >
    70cl / 40%
   </p>
  </div>
  <div >
   <p >
    £39.95
   </p>
   <p >
    (£57.07 per litre)
   </p>
  </div>
 </a>
</li>
'''

from bs4 import BeautifulSoup
soup=BeautifulSoup(html, "html.parser")
#print(soup.prettify())
print(soup.find('img', {'class':'product-card__image'}).get('src'))

Output:

https://img.thewhiskyexchange.com/480/rum_bum4.jpg
  • Related