Hello everybody,
I am trying to extract a value stored in a "span" which does not have a direct class. In the html below, there are two classes of my interest: "bill_of_sale" and "mortgage". They have two span values: "Kupça var" and "İpoteka var" respectively. I need to exract these values for each item. I have done item part already. I just need to extract these values stored deeply in the classes.
<div >
<div data-swiper-wrap="" style="touch-action: pan-y; user-select: none; -webkit-user-drag: none; -webkit-tap-highlight-color: rgba(0, 0, 0, 0);">
<div ><a target="_blank" href="/items/2810476"></a><span>Kupça var</span></div>
<div ><a target="_blank" href="/items/2810476"></a><span>İpoteka var</span></div>
<div ><span ></span><span ></span></div>
<div >Agentlik</div>
The code below allows me to extract values if it comes from a class with the following structure:
<div >Həzi Aslanov m.</div>
page = 1
locations=[] #List to store price of the product
while page != 1200:
url = f"https://bina.az/baki/alqi-satqi/menziller?page={page}"
page_main = requests.get(url)
soup = BeautifulSoup(page_main.content, "html.parser")
results = soup.find(id="js-items-search")
job_elements = results.find_all("div", class_="card_params")
for job_element in job_elements:
location = job_element.find(class_="location")
locations.append(location.text)
page = page 1
However, the code above does not work if I want to extract a span value which is deep inside a class (the problem I described in the beginning).
Thank you in advance
CodePudding user response:
You can access the deeper results of a class like this:
bill = item.find('div',class_='bill_of_sale').find('span').text.strip()
Here is a working example that will get the details of all listings and output the results to csv:
import requests
from bs4 import BeautifulSoup
import pandas as pd
page = 1
locations=[] #List to store price of the product
while page != 20:
print(f'Scraping page {page}')
url = f"https://bina.az/baki/alqi-satqi/menziller?page={page}"
page_main = requests.get(url)
soup = BeautifulSoup(page_main.content, "html.parser")
for results in soup.find_all('div',class_='items_list'): #there are multiple listing containers
for item in results.find_all('div',class_='vipped'):
location = item.find(class_="location").text.strip()
try:
bill = item.find('div',class_='bill_of_sale').find('span').text.strip()
except AttributeError:
bill = ''
try:
mort = item.find('div',class_='mortgage').find('span').text.strip()
except AttributeError:
mort = ''
price = item.find('div',class_='price').text.strip()
rooms,size,floor = ('','','')
for detail in item.find('ul',class_='name').find_all('li'):
if 'otaqlı' in detail.text:
rooms = detail.text.strip()
elif 'm²' in detail.text:
size = detail.text.strip()
elif 'mərtəbə' in detail.text:
floor = detail.text.strip()
item = {
'location':location,
'bill':bill,
'mortage':mort,
'price': price,
'rooms':rooms,
'size':size,
'floor':floor
}
locations.append(item)
page = 1
df = pd.DataFrame(locations)
df.to_csv('locations.csv',index=False)
CodePudding user response:
Once you get that node by the specified <div>
and class, you can use .find_next()
to get that <span>
:
from bs4 import BeautifulSoup, Comment
html = '''<div >
<div data-swiper-wrap="" style="touch-action: pan-y; user-select: none; -webkit-user-drag: none; -webkit-tap-highlight-color: rgba(0, 0, 0, 0);">
<div ><a target="_blank" href="/items/2810476"></a><span>Kupça var</span></div>
<div ><a target="_blank" href="/items/2810476"></a><span>İpoteka var</span></div>
<div ><span ></span><span ></span></div>
<div >Agentlik</div>'''
soup = BeautifulSoup(html, 'html.parser')
div_bos = soup.find('div', {'class':'bill_of_sale'}).find_next('span').text
div_mortgage = soup.find('div', {'class':'mortgage'}).find_next('span').text
print(div_bos)
print(div_mortgage)
Output:
print(div_bos)
print(div_mortgage)
Kupça var
İpoteka var