Home > Software engineering >  Scrape specific data with beautiful soup when there is no unique class or id?
Scrape specific data with beautiful soup when there is no unique class or id?

Time:11-27

Im trying to scrape this page with Beautifulsoup. https://www.nb.co.za/en/view-book/?id=9780798182539

How do I target specific elements if they don't have unique class or id? Is it possible to scrape a div based on the value/text in the sibling div?

For instance, in the code below, how can I get 9780798182539 if the sibling div is <p>ISBN:</p>

<div class="row clearfix">
    <div class="col-md-3 noPadding">
        <p>ISBN:</p>
    </div>
    <div class="col-md-9 noPadding">
        9780798182539
    </div>
</div>

Here is the complete html:

<div class="col-lg-7 col-md-12 col-sm-12 author-details">
    <h2>Step by Step: Counting to 50 </h2>
    <h5>
        <a href="/en/authors?authorId=2163">Cuberdon</a>
    </h5>

    <div class="row clearfix">
        <div class="col-md-3 noPadding">
            <p>ISBN:</p>
        </div>
        <div class="col-md-9 noPadding">
            9780798182539
        </div>
    </div>
    <div class="row clearfix">
        <div class="col-md-3 noPadding">
            <p>Publisher:</p>
        </div>
        <div class="col-md-9 noPadding">
            Human &amp; Rousseau
        </div>
    </div>
    <div class="row clearfix">
        <div class="col-md-3 noPadding">
            <p>Date Released:</p>
        </div>
        <div class="col-md-9 noPadding">
            November 2021
        </div>
    </div>
    <div class="row clearfix">
        <div class="col-md-3 noPadding">
            <p>Price (incl. VAT):</p>
        </div>
        <div class="col-md-9 noPadding">
            R 120.00
        </div>
    </div>
    <div class="row clearfix">
        <div class="col-md-3 noPadding">
            <p>Format:</p>
        </div>
        <div class="col-md-9 noPadding">
                    <p>Hard cover, 32pp</p>
        </div>
    </div>
</div>

CodePudding user response:

You could do a find_all on the main divs with class row clearfix, then filter on the divs that contain the string ISBN, and do a find on that div for the div with class col-md-9 noPadding. It would like this in list comprehension:

[i.find('div', class_='col-md-9 noPadding').get_text().strip() for i in soup.find_all('div', class_='row clearfix') if 'ISBN:' in i.get_text()][0]

Output:

9780798182539

CodePudding user response:

Here is the working solution, so far.

from bs4 import BeautifulSoup

html = '''
<div >
    <h2>Step by Step: Counting to 50 </h2>
    <h5>
        <a href="/en/authors?authorId=2163">Cuberdon</a>
    </h5>

    <div >
        <div >
            <p>ISBN:</p>
        </div>
        <div >
            9780798182539
        </div>
    </div>
    <div >
        <div >
            <p>Publisher:</p>
        </div>
        <div >
            Human &amp; Rousseau
        </div>
    </div>
    <div >
        <div >
            <p>Date Released:</p>
        </div>
        <div >
            November 2021
        </div>
    </div>
    <div >
        <div >
            <p>Price (incl. VAT):</p>
        </div>
        <div >
            R 120.00
        </div>
    </div>
    <div >
        <div >
            <p>Format:</p>
        </div>
        <div >
                    <p>Hard cover, 32pp</p>
        </div>
    </div>
</div>
'''
soup = BeautifulSoup(html, "html.parser")
div_text =soup.find('div',class_="col-md-9 noPadding")
print(div_text.get_text(strip=True))

Output:

9780798182539

CodePudding user response:

You can use :-soup-contains to target the p tag by its text. Wrap around the :has pseudo-class selector, and specify the relationship as direct parent child with a child > combinator, to get the immediate parent div. Then throw in an adjacent sibling combinator , with div type selector, to move to the adjacent, div:

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('http://www.nb.co.za/nb/view-book?id=9780798182539')
soup = bs(r.content, 'lxml')
print(soup.select_one('div:has(> p:-soup-contains("ISBN:"))   div' ).text.strip())
  • Related