How to Extract all Urls from href under a but it seems to give me an error all the time-CodePudding

category_tag = soup.find_all('div' , {'class': '_p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8'})

Output of category_tag:

<div  role="treeitem"><a href="/gp/bestsellers/books/1318158031">Action &amp; Adventure</a></div>,
 <div  role="treeitem"><a href="/gp/bestsellers/books/1318052031">Arts, Film &amp; Photography</a></div>,
 <div  role="treeitem"><a href="/gp/bestsellers/books/1318064031">Biographies, Diaries &amp; True Accounts</a></div>,
 <div  role="treeitem"><a href="/gp/bestsellers/books/1318068031">Business &amp; Economics</a></div>,
 <div  role="treeitem"><a href="/gp/bestsellers/books/1318073031">Children's &amp; Young Adult</a></div>,
 <div  role="treeitem"><a href="/gp/bestsellers/books/1318104031">Comics &amp; Mangas</a></div>,
 <div  role="treeitem"><a href="/gp/bestsellers/books/1318105031">Computing, Internet &amp; Digital Media</a></div>,
 <div  role="treeitem"><a href="/gp/bestsellers/books/1318118031">Crafts, Home &amp; Lifestyle</a></div>,

Now the problem is, I am not able to extract href from ''. It keeps showing error.

I have already tried:

category_url_tag = category_tag.find('a')['href']

But it keeps showing an error.

category_url = []
for tag in category_tag:
    category_url.append(tag.get('href'))
print(category_url)

This printed a list containing None.

CodePudding user response：

Try to select your elements more specific and use id and tag structure over dynamic classes:

soup.select('#zg-left-col a')

or to be more strict, to use only path that starts with specific pattern:

soup.select('#zg-left-col a[href^="/gp/bestsellers/books"]')

So list could be created via lis comprehension:

['https://www.amazon.in' a.get('href') for a in soup.select('#zg-left-col a[href^="/gp/bestsellers/books"]')]

Example

This deals with dict comprehension to get only unique urls and on top also the category name:

import requests
from bs4 import BeautifulSoup

soup = BeautifulSoup(requests.get('https://www.amazon.in/gp/bestsellers/books/').text)


{'https://www.amazon.in' a.get('href'):a.text for a in soup.select('#zg-left-col a[href^="/gp/bestsellers/books"]')}

Output

{'https://www.amazon.in/gp/bestsellers/books/1318158031': 'Action & Adventure',
 'https://www.amazon.in/gp/bestsellers/books/1318052031': 'Arts, Film & Photography',
 'https://www.amazon.in/gp/bestsellers/books/1318064031': 'Biographies, Diaries & True Accounts',
 'https://www.amazon.in/gp/bestsellers/books/1318068031': 'Business & Economics',
 'https://www.amazon.in/gp/bestsellers/books/1318073031': "Children's & Young Adult",
 'https://www.amazon.in/gp/bestsellers/books/1318104031': 'Comics & Mangas',
 'https://www.amazon.in/gp/bestsellers/books/1318105031': 'Computing, Internet & Digital Media',
 'https://www.amazon.in/gp/bestsellers/books/1318118031': 'Crafts, Home & Lifestyle',
 'https://www.amazon.in/gp/bestsellers/books/1318161031': 'Crime, Thriller & Mystery',
 'https://www.amazon.in/gp/bestsellers/books/22960344031': 'Engineering',...}

CodePudding user response：

You are looping over the div and all. You should find the inside of the div.

Please check the following code. It should give you the expected result.

category_tag = soup.find_all('div' , {'class': '_p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8'})
categories = [(cat.find('a').text, cat.find('a')['href']) for cat in category_tag[1:]]