I am trying to scrape this url:
Can any one suggest how to scrape this data?
CodePudding user response:
Based on that part of question:
So, I tried to rewrite the code using the below loop.
To get a single value of the bullet select it with css selector
and pseudo class :-soup-contains()
with next sibling operator:
soup.select_one('#detailBulletsWrapper_feature_div span:-soup-contains("Manufacturer") span').text
To get a dict
of the bullets and its values use a dict comprehension
what enables you to pick or filter based on available keys:
{e.text.split('\n')[0]:e.find_next_sibling('span').text for e in soup.select('#detailBulletsWrapper_feature_div li .a-text-bold')}
Be aware, if there are duplictaed bullets, this has to be adjust in a way it fits best to your needs, because there have to be unique keys in a dict
Possible solution for first value wins:
details = {}
for e in soup.select('#detailBulletsWrapper_feature_div li .a-text-bold'):
if not details.get(e.text.split('\n')[0]):
details.update({e.text.split('\n')[0]:e.find_next_sibling('span').text} )
Example
from bs4 import BeautifulSoup
html = '''
<div id="detailBulletsWrapper_feature_div" data-feature-name="detailBullets" data-template-name="detailBullets" data-cel-widget="detailBulletsWrapper_feature_div"> <hr aria-hidden="true" > <h2>Product details</h2>
<div id="detailBullets_feature_div">
<ul > <li><span > <span >Product Dimensions
:
</span> <span>33 x 23 x 12 cm; 600 Grams</span> </span></li> <li><span > <span >Date First Available
:
</span> <span>30 June 2021</span> </span></li> <li><span > <span >Manufacturer
:
</span> <span>RELAXO FOOTWEARS LIMITED</span> </span></li> <li><span > <span >ASIN
:
</span> <span>B098BC48PZ</span> </span></li> <li><span > <span >Item model number
:
</span> <span>SX0687G</span> </span></li> <li><span > <span >Country of Origin
:
</span> <span>India</span> </span></li> <li><span > <span >Department
:
</span> <span>Mens</span> </span></li> <li><span > <span >Manufacturer
:
</span> <span>RELAXO FOOTWEARS LIMITED, RELAXO FOOTWEARS LIMITED, Aggarwal City Square, Plot No 10, Mangalam Palace. District Center, Rohini Sector-3, Delhi - 110085</span> </span></li> <li><span > <span >Packer
:
</span> <span>VIRAJ ENTERPRISES, Killa No. 31/18/1/2(2-4), Surya Nagar, Gali No. 1, Near Parle Factory, Jhajjar, Bahadurgarh, 124507</span> </span></li> </div>
</div>
'''
soup = BeautifulSoup(html)
details = {}
for e in soup.select('#detailBulletsWrapper_feature_div li .a-text-bold'):
if not details.get(e.text.split('\n')[0]):
details.update({e.text.split('\n')[0]:e.find_next_sibling('span').text} )
print(soup.select_one('#detailBulletsWrapper_feature_div span:-soup-contains("Manufacturer") span').text)
print(details)
Outputs
Under Armour
and
{'Product Dimensions': '33 x 23 x 12 cm; 600 Grams',
'Date First Available': '30 June 2021',
'Manufacturer': 'RELAXO FOOTWEARS LIMITED',
'ASIN': 'B098BC48PZ',
'Item model number': 'SX0687G',
'Country of Origin': 'India',
'Department': 'Mens',
'Packer': 'VIRAJ ENTERPRISES, Killa No. 31/18/1/2(2-4), Surya Nagar, Gali No. 1, Near Parle Factory, Jhajjar, Bahadurgarh, 124507'}