category_tag = soup.find_all('div' , {'class': '_p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8'})
Output of category_tag:
<div role="treeitem"><a href="/gp/bestsellers/books/1318158031">Action & Adventure</a></div>,
<div role="treeitem"><a href="/gp/bestsellers/books/1318052031">Arts, Film & Photography</a></div>,
<div role="treeitem"><a href="/gp/bestsellers/books/1318064031">Biographies, Diaries & True Accounts</a></div>,
<div role="treeitem"><a href="/gp/bestsellers/books/1318068031">Business & Economics</a></div>,
<div role="treeitem"><a href="/gp/bestsellers/books/1318073031">Children's & Young Adult</a></div>,
<div role="treeitem"><a href="/gp/bestsellers/books/1318104031">Comics & Mangas</a></div>,
<div role="treeitem"><a href="/gp/bestsellers/books/1318105031">Computing, Internet & Digital Media</a></div>,
<div role="treeitem"><a href="/gp/bestsellers/books/1318118031">Crafts, Home & Lifestyle</a></div>,
Now the problem is, I am not able to extract href
from ''. It keeps showing error.
I have already tried:
category_url_tag = category_tag.find('a')['href']
But it keeps showing an error.
category_url = []
for tag in category_tag:
category_url.append(tag.get('href'))
print(category_url)
This printed a list containing None
.
CodePudding user response:
Try to select your elements more specific and use id
and tag
structure over dynamic classes:
soup.select('#zg-left-col a')
or to be more strict, to use only path that starts with specific pattern:
soup.select('#zg-left-col a[href^="/gp/bestsellers/books"]')
So list
could be created via lis comprehension
:
['https://www.amazon.in' a.get('href') for a in soup.select('#zg-left-col a[href^="/gp/bestsellers/books"]')]
Example
This deals with dict comprehension
to get only unique urls and on top also the category name:
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('https://www.amazon.in/gp/bestsellers/books/').text)
{'https://www.amazon.in' a.get('href'):a.text for a in soup.select('#zg-left-col a[href^="/gp/bestsellers/books"]')}
Output
{'https://www.amazon.in/gp/bestsellers/books/1318158031': 'Action & Adventure',
'https://www.amazon.in/gp/bestsellers/books/1318052031': 'Arts, Film & Photography',
'https://www.amazon.in/gp/bestsellers/books/1318064031': 'Biographies, Diaries & True Accounts',
'https://www.amazon.in/gp/bestsellers/books/1318068031': 'Business & Economics',
'https://www.amazon.in/gp/bestsellers/books/1318073031': "Children's & Young Adult",
'https://www.amazon.in/gp/bestsellers/books/1318104031': 'Comics & Mangas',
'https://www.amazon.in/gp/bestsellers/books/1318105031': 'Computing, Internet & Digital Media',
'https://www.amazon.in/gp/bestsellers/books/1318118031': 'Crafts, Home & Lifestyle',
'https://www.amazon.in/gp/bestsellers/books/1318161031': 'Crime, Thriller & Mystery',
'https://www.amazon.in/gp/bestsellers/books/22960344031': 'Engineering',...}
CodePudding user response:
You are looping over the div and all. You should find the inside of the div.
Please check the following code. It should give you the expected result.
category_tag = soup.find_all('div' , {'class': '_p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8'})
categories = [(cat.find('a').text, cat.find('a')['href']) for cat in category_tag[1:]]