Scraping a website that has multiple products on the same page, some that I don't want to know the prices of. So I wanted to first see the product category to then get the price listed.
The website code looks like this:
<section >
<span something I don't want>...</span>
<section >
<span>Clothes</span>
<div something I don't want>...</div>
<section >
<section>
<span something I don't want>...</span>
<span >149.99</span>
</section>
</section>
I already know how to get to the category part with my own code, but I'm completely stuck on the other part.
for products in soup.find_all(class_='category'):
category = (products.text)
if category == 'Clothes':
price = (theoretical piece of code)
How can I get to the specific price tag within this parent <section>
tag?
CodePudding user response:
Use from regex
Import re
Pat = '''
price">(/d*./d*)</span>
'''
price = re.find_all(your text,pat)
CodePudding user response:
Using CSS selector, it's working as expectation.
html='''
<section >
<span something I don't want>...</span>
<section >
<span>Clothes</span>
<div something I don't want>...</div>
<section >
<section>
<span something I don't want>...</span>
<span >149.99</span>
</section>
</section>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for p in soup.select('.category section.search_result_price span.price'):
print(p.text)
Output:
149.99
CodePudding user response:
You are close to your goal but be aware that products.text
will give you the whole section text, better use products.span.text
to get the category text only.
To get the price info, simply find the span with and check if it is available or not to avoid errors:
price = products.find(class_='price').text if products.find('span', class_='price') else None
Example
from bs4 import BeautifulSoup
html='''
<section >
<span something I don't want>...</span>
<section >
<span>Clothes</span>
<div something I don't want>...</div>
<section >
<section>
<span something I don't want>...</span>
<span >149.99</span>
</section>
</section>'''
soup = BeautifulSoup(html, 'html.parser')
for products in soup.find_all('section', class_='category'):
category = products.span.text
if category == 'Clothes':
price = products.find(class_='price').text if products.find('span', class_='price') else None
print(price)
Output
149.99
As alternative an approach that is more lean, creates a structured output that is easy to process and deals with a list of permitted categories:
from bs4 import BeautifulSoup
html='''
<section >
<span something I don't want>...</span>
<section >
<span>Clothes</span>
<div something I don't want>...</div>
<section >
<section>
<span something I don't want>...</span>
<span >149.99</span>
</section>
<span something I don't want>...</span>
<section >
<span>Shoes</span>
<div something I don't want>...</div>
<section >
<section>
<span something I don't want>...</span>
<span >90.99</span>
</section>
</section>'''
soup = BeautifulSoup(html, 'html.parser')
data = []
c_list = ['Clothes','Shoes']
for products in soup.select(f"section.category:-soup-contains({','.join(c_list)})"):
data.append({
'category' : products.span.text,
'price' : products.find(class_='price').text if products.find('span', class_='price') else None
})
data
Output
[{'category': 'Clothes', 'price': '149.99'},
{'category': 'Shoes', 'price': '90.99'}]