How to find a tag within the same parent that has the child I want?-CodePudding

Scraping a website that has multiple products on the same page, some that I don't want to know the prices of. So I wanted to first see the product category to then get the price listed.

The website code looks like this:

<section >
   <span something I don't want>...</span>
   <section >
      <span>Clothes</span>
   <div something I don't want>...</div>
   <section >
      <section>
         <span something I don't want>...</span>
         <span >149.99</span>
      </section>
</section>

I already know how to get to the category part with my own code, but I'm completely stuck on the other part.

for products in soup.find_all(class_='category'):
   category = (products.text)
   if category == 'Clothes':
      price = (theoretical piece of code)

How can I get to the specific price tag within this parent <section> tag?

CodePudding user response：

Use from regex

Import re
Pat = '''
price">(/d*./d*)</span>
'''
price = re.find_all(your text,pat)

CodePudding user response：

Using CSS selector, it's working as expectation.

html='''
<section >
   <span something I don't want>...</span>
   <section >
      <span>Clothes</span>
   <div something I don't want>...</div>
   <section >
      <section>
         <span something I don't want>...</span>
         <span >149.99</span>
      </section>
</section>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for p in soup.select('.category section.search_result_price span.price'):
    print(p.text)

Output:

149.99

CodePudding user response：

You are close to your goal but be aware that products.text will give you the whole section text, better use products.span.text to get the category text only.

To get the price info, simply find the span with and check if it is available or not to avoid errors:

price = products.find(class_='price').text if products.find('span', class_='price') else None

Example

from bs4 import BeautifulSoup

html='''
<section >
   <span something I don't want>...</span>
   <section >
      <span>Clothes</span>
   <div something I don't want>...</div>
   <section >
      <section>
         <span something I don't want>...</span>
         <span >149.99</span>
      </section>
</section>'''

soup = BeautifulSoup(html, 'html.parser')

for products in soup.find_all('section', class_='category'):
    category = products.span.text
    if category == 'Clothes':
        price = products.find(class_='price').text if products.find('span', class_='price') else None
        print(price)

Output

149.99

As alternative an approach that is more lean, creates a structured output that is easy to process and deals with a list of permitted categories:

from bs4 import BeautifulSoup

    html='''
    <section >
       <span something I don't want>...</span>
       <section >
          <span>Clothes</span>
       <div something I don't want>...</div>
       <section >
          <section>
             <span something I don't want>...</span>
             <span >149.99</span>
          </section>
       <span something I don't want>...</span>
       <section >
          <span>Shoes</span>
       <div something I don't want>...</div>
       <section >
          <section>
             <span something I don't want>...</span>
             <span >90.99</span>
          </section>
    </section>'''
    
    soup = BeautifulSoup(html, 'html.parser')
    
    data = []
    
    c_list = ['Clothes','Shoes']
    
    for products in soup.select(f"section.category:-soup-contains({','.join(c_list)})"):
        data.append({
            'category' : products.span.text,
            'price' : products.find(class_='price').text if products.find('span', class_='price') else None
        })
    
    data

Output

[{'category': 'Clothes', 'price': '149.99'},
 {'category': 'Shoes', 'price': '90.99'}]