Home > OS >  How to Extract all Urls from href under a but it seems to give me an error all the time
How to Extract all Urls from href under a but it seems to give me an error all the time

Time:11-18

category_tag = soup.find_all('div' , {'class': '_p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8'})

Output of category_tag:

<div  role="treeitem"><a href="/gp/bestsellers/books/1318158031">Action &amp; Adventure</a></div>,
 <div  role="treeitem"><a href="/gp/bestsellers/books/1318052031">Arts, Film &amp; Photography</a></div>,
 <div  role="treeitem"><a href="/gp/bestsellers/books/1318064031">Biographies, Diaries &amp; True Accounts</a></div>,
 <div  role="treeitem"><a href="/gp/bestsellers/books/1318068031">Business &amp; Economics</a></div>,
 <div  role="treeitem"><a href="/gp/bestsellers/books/1318073031">Children's &amp; Young Adult</a></div>,
 <div  role="treeitem"><a href="/gp/bestsellers/books/1318104031">Comics &amp; Mangas</a></div>,
 <div  role="treeitem"><a href="/gp/bestsellers/books/1318105031">Computing, Internet &amp; Digital Media</a></div>,
 <div  role="treeitem"><a href="/gp/bestsellers/books/1318118031">Crafts, Home &amp; Lifestyle</a></div>,

Now the problem is, I am not able to extract href from ''. It keeps showing error.

I have already tried:

category_url_tag = category_tag.find('a')['href']

But it keeps showing an error.

category_url = []
for tag in category_tag:
    category_url.append(tag.get('href'))
print(category_url)

This printed a list containing None.

CodePudding user response:

Try to select your elements more specific and use id and tag structure over dynamic classes:

soup.select('#zg-left-col a')

or to be more strict, to use only path that starts with specific pattern:

soup.select('#zg-left-col a[href^="/gp/bestsellers/books"]')

So list could be created via lis comprehension:

['https://www.amazon.in' a.get('href') for a in soup.select('#zg-left-col a[href^="/gp/bestsellers/books"]')]

Example

This deals with dict comprehension to get only unique urls and on top also the category name:

import requests
from bs4 import BeautifulSoup

soup = BeautifulSoup(requests.get('https://www.amazon.in/gp/bestsellers/books/').text)


{'https://www.amazon.in' a.get('href'):a.text for a in soup.select('#zg-left-col a[href^="/gp/bestsellers/books"]')}

Output

{'https://www.amazon.in/gp/bestsellers/books/1318158031': 'Action & Adventure',
 'https://www.amazon.in/gp/bestsellers/books/1318052031': 'Arts, Film & Photography',
 'https://www.amazon.in/gp/bestsellers/books/1318064031': 'Biographies, Diaries & True Accounts',
 'https://www.amazon.in/gp/bestsellers/books/1318068031': 'Business & Economics',
 'https://www.amazon.in/gp/bestsellers/books/1318073031': "Children's & Young Adult",
 'https://www.amazon.in/gp/bestsellers/books/1318104031': 'Comics & Mangas',
 'https://www.amazon.in/gp/bestsellers/books/1318105031': 'Computing, Internet & Digital Media',
 'https://www.amazon.in/gp/bestsellers/books/1318118031': 'Crafts, Home & Lifestyle',
 'https://www.amazon.in/gp/bestsellers/books/1318161031': 'Crime, Thriller & Mystery',
 'https://www.amazon.in/gp/bestsellers/books/22960344031': 'Engineering',...}

CodePudding user response:

You are looping over the div and all. You should find the inside of the div.

Please check the following code. It should give you the expected result.

category_tag = soup.find_all('div' , {'class': '_p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8'})
categories = [(cat.find('a').text, cat.find('a')['href']) for cat in category_tag[1:]]
  • Related