Im trying to scrape this page with Beautifulsoup.
https://www.nb.co.za/en/view-book/?id=9780798182539
How do I target specific elements if they don't have unique class or id?
Is it possible to scrape a div
based on the value/text in the sibling div
?
For instance, in the code below, how can I get 9780798182539
if the sibling div is <p>ISBN:</p>
<div class="row clearfix">
<div class="col-md-3 noPadding">
<p>ISBN:</p>
</div>
<div class="col-md-9 noPadding">
9780798182539
</div>
</div>
Here is the complete html:
<div class="col-lg-7 col-md-12 col-sm-12 author-details">
<h2>Step by Step: Counting to 50 </h2>
<h5>
<a href="/en/authors?authorId=2163">Cuberdon</a>
</h5>
<div class="row clearfix">
<div class="col-md-3 noPadding">
<p>ISBN:</p>
</div>
<div class="col-md-9 noPadding">
9780798182539
</div>
</div>
<div class="row clearfix">
<div class="col-md-3 noPadding">
<p>Publisher:</p>
</div>
<div class="col-md-9 noPadding">
Human & Rousseau
</div>
</div>
<div class="row clearfix">
<div class="col-md-3 noPadding">
<p>Date Released:</p>
</div>
<div class="col-md-9 noPadding">
November 2021
</div>
</div>
<div class="row clearfix">
<div class="col-md-3 noPadding">
<p>Price (incl. VAT):</p>
</div>
<div class="col-md-9 noPadding">
R 120.00
</div>
</div>
<div class="row clearfix">
<div class="col-md-3 noPadding">
<p>Format:</p>
</div>
<div class="col-md-9 noPadding">
<p>Hard cover, 32pp</p>
</div>
</div>
</div>
CodePudding user response:
You could do a find_all
on the main divs with class row clearfix
, then filter on the divs that contain the string ISBN
, and do a find
on that div for the div with class col-md-9 noPadding
. It would like this in list comprehension:
[i.find('div', class_='col-md-9 noPadding').get_text().strip() for i in soup.find_all('div', class_='row clearfix') if 'ISBN:' in i.get_text()][0]
Output:
9780798182539
CodePudding user response:
Here is the working solution, so far.
from bs4 import BeautifulSoup
html = '''
<div >
<h2>Step by Step: Counting to 50 </h2>
<h5>
<a href="/en/authors?authorId=2163">Cuberdon</a>
</h5>
<div >
<div >
<p>ISBN:</p>
</div>
<div >
9780798182539
</div>
</div>
<div >
<div >
<p>Publisher:</p>
</div>
<div >
Human & Rousseau
</div>
</div>
<div >
<div >
<p>Date Released:</p>
</div>
<div >
November 2021
</div>
</div>
<div >
<div >
<p>Price (incl. VAT):</p>
</div>
<div >
R 120.00
</div>
</div>
<div >
<div >
<p>Format:</p>
</div>
<div >
<p>Hard cover, 32pp</p>
</div>
</div>
</div>
'''
soup = BeautifulSoup(html, "html.parser")
div_text =soup.find('div',class_="col-md-9 noPadding")
print(div_text.get_text(strip=True))
Output:
9780798182539
CodePudding user response:
You can use :-soup-contains
to target the p
tag by its text. Wrap around the :has
pseudo-class selector, and specify the relationship as direct parent child with a child >
combinator, to get the immediate parent div
. Then throw in an adjacent sibling combinator
, with div
type selector, to move to the adjacent, div
:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('http://www.nb.co.za/nb/view-book?id=9780798182539')
soup = bs(r.content, 'lxml')
print(soup.select_one('div:has(> p:-soup-contains("ISBN:")) div' ).text.strip())