Decided to play around with web scraping. Got stuck with a tricky div block, and spent hours searching and trying to figure out how to solve this issue and return the expected output I would have expected by default. But can't seem to get my head around the approach to take.
I'm having problems with div under the class "listing__details-pricing". Div with class "listing__details-pricing" comes in three different forms. Form 3 returns my expected outcomes, the other forms return additional values that I didn't expect to be returned.
Form 1:
<div >
€16,000
<div >Private</div>
</div>
Form 2:
<div >
€16,000
<div >
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 512 512">
<path d="M235.4 172.2c0-11.4 9.3-19.9 20.5-19.9 11.4 0 20.7 8.5 20.7 19.9s-9.3 20-20.7 20c-11.2 0-20.5-8.6-20.5-20zm1.4 35.7H275V352h-38.2V207.9z"></path>
<path d="M256 76c48.1 0 93.3 18.7 127.3 52.7S436 207.9 436 256s-18.7 93.3-52.7 127.3S304.1 436 256 436c-48.1 0-93.3-18.7-127.3-52.7S76 304.1 76 256s18.7-93.3 52.7-127.3S207.9 76 256 76m0-28C141.1 48 48 141.1 48 256s93.1 208 208 208 208-93.1 208-208S370.9 48 256 48z"></path>
</svg>
€306
<div >PER MONTH</div>
</div>
</div>
Form 3:
<div >€16,250</div>
Code:
from bs4 import BeautifulSoup
html = """<html>
<body>
<div >
<div >
<div >Meath</div>
<div >
<h2>VOLKSWAGEN Golf</h2>
<p>1.6 TDI MATCH EDITION BLUEMOTION 110PS 5DR</p>
</div>
<div >
<div >
<p>2016</p>
</div>
<div >(161 REG)</div>
<div >140,012 km</div>
</div>
<div >
€16,000
<div >Private</div>
</div>
<div >
<span style="background-color: black;"></span>
<p>Black</p>
</div>
</div>
<div >
<div >Longford</div>
<div >
<h2>VOLKSWAGEN Passat</h2>
<p>2.0 TDI SE BUSINESS</p>
</div>
<div >
<div >
<p>2015</p>
</div>
<div >(152 REG)</div>
<div >164,778 km</div>
</div>
<div >€16,250</div>
<div >
<span style="background-color: black;"></span>
<p>Black</p>
</div>
</div>
<div >
<div >Monaghan</div>
<div >
<h2>VOLKSWAGEN Passat</h2>
<p>HIGHLINE BE 2.0 TDI MANUAL 6SPEED FWD 150HP 4DR</p>
</div>
<div >
<div >
<p>2016</p>
</div>
<div >(161 REG)</div>
<div >230,000 km</div>
</div>
<div >
€16,000
<div >
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 512 512">
<path d="M235.4 172.2c0-11.4 9.3-19.9 20.5-19.9 11.4 0 20.7 8.5 20.7 19.9s-9.3 20-20.7 20c-11.2 0-20.5-8.6-20.5-20zm1.4 35.7H275V352h-38.2V207.9z"></path>
<path d="M256 76c48.1 0 93.3 18.7 127.3 52.7S436 207.9 436 256s-18.7 93.3-52.7 127.3S304.1 436 256 436c-48.1 0-93.3-18.7-127.3-52.7S76 304.1 76 256s18.7-93.3 52.7-127.3S207.9 76 256 76m0-28C141.1 48 48 141.1 48 256s93.1 208 208 208 208-93.1 208-208S370.9 48 256 48z"></path>
</svg>
€306
<div >PER MONTH</div>
</div>
</div>
<div >
<span style="background-color: black;"></span>
<p>Black</p>
</div>
</div>
<div ></div>
</div>
</body>
</html>
"""
soup = BeautifulSoup(html, "html.parser")
results = soup.find(class_="vehicle-search-form__results")
job_elements = results.find_all(class_="listing__details listing__details--desktop")
for job_element in job_elements:
price = job_element.find(class_="listing__details-pricing")
print(price.text.strip())
Current output:
€16,000
Private
€16,250
€16,000€306PER MONTH
Expected output:
€16,000
€16,250
€16,000
CodePudding user response:
Change the last line to:
print(price.contents[0].strip())
This prints:
€16,000
€16,250
€16,000
Or:
print(price.find(text=True).strip())
CodePudding user response:
All price values are immediate after <div >
which is called text node. You directly can apply class_="listing__details-pricing"
then to get text node value by calling find(text=True)
from bs4 import BeautifulSoup
html = """<html>
<body>
<div >
<div >
<div >Meath</div>
<div >
<h2>VOLKSWAGEN Golf</h2>
<p>1.6 TDI MATCH EDITION BLUEMOTION 110PS 5DR</p>
</div>
<div >
<div >
<p>2016</p>
</div>
<div >(161 REG)</div>
<div >140,012 km</div>
</div>
<div >
€16,000
<div >Private</div>
</div>
<div >
<span style="background-color: black;"></span>
<p>Black</p>
</div>
</div>
<div >
<div >Longford</div>
<div >
<h2>VOLKSWAGEN Passat</h2>
<p>2.0 TDI SE BUSINESS</p>
</div>
<div >
<div >
<p>2015</p>
</div>
<div >(152 REG)</div>
<div >164,778 km</div>
</div>
<div >€16,250</div>
<div >
<span style="background-color: black;"></span>
<p>Black</p>
</div>
</div>
<div >
<div >Monaghan</div>
<div >
<h2>VOLKSWAGEN Passat</h2>
<p>HIGHLINE BE 2.0 TDI MANUAL 6SPEED FWD 150HP 4DR</p>
</div>
<div >
<div >
<p>2016</p>
</div>
<div >(161 REG)</div>
<div >230,000 km</div>
</div>
<div >
€16,000
<div >
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 512 512">
<path d="M235.4 172.2c0-11.4 9.3-19.9 20.5-19.9 11.4 0 20.7 8.5 20.7 19.9s-9.3 20-20.7 20c-11.2 0-20.5-8.6-20.5-20zm1.4 35.7H275V352h-38.2V207.9z"></path>
<path d="M256 76c48.1 0 93.3 18.7 127.3 52.7S436 207.9 436 256s-18.7 93.3-52.7 127.3S304.1 436 256 436c-48.1 0-93.3-18.7-127.3-52.7S76 304.1 76 256s18.7-93.3 52.7-127.3S207.9 76 256 76m0-28C141.1 48 48 141.1 48 256s93.1 208 208 208 208-93.1 208-208S370.9 48 256 48z"></path>
</svg>
€306
<div >PER MONTH</div>
</div>
</div>
<div >
<span style="background-color: black;"></span>
<p>Black</p>
</div>
</div>
<div ></div>
</div>
</body>
</html>
"""
soup = BeautifulSoup(html, "html.parser")
job_elements = soup.find_all(class_="listing__details-pricing")
for job_element in job_elements:
price = job_element.find(text=True).strip()
print(price)
Output:
€16,000
€16,250
€16,000