Home > Enterprise >  Get information in sub-tags
Get information in sub-tags

Time:07-05

I'm trying to retrieve information from a site by web scraping. The information I need is found in sub-tabs, but I'm not able to get it

<div >
 <div ><span>
 House
 3
 pièces,                                                                                                         
74 m²
 </span>
 <cite>
 New York (11111)
 </cite>
 </div>
</div>,
 <div >
 <div ><span>
 Appartement
 3
 pièces,                                                                                                         
64 m²
 </span>
 <cite>
 Los Angeles (22222)
 </cite>
 </div>
 <div >
 <div ><span>
 House
 4
 pièces,                                                                                                         
81 m²
 </span>
 <cite>
 Chicago (33333)
 </cite>
 </div>

I'm trying to get the ad and the city. I tried:

#BeautifulSoup
from bs4 import BeautifulSoup
import requests

#to get: House 3 pièces, 74 m²
ad = [ad.get_text() for ad in soup.find_all("span", class_='ergov3-txtannonce')]  

#to get cities       
cities = [city.get_text() for city in soup.find_all("cite", class_='ergov3-txtannonce')]

My output:

[]
[]

Good output:

["House 3 pièces, 74 m²", "Appartement 3 pièces, 64 m²", "House 4 pièces, 81 m²"]                                                                                                       
["New York (11111)", "Los Angeles (22222)", "Chicago (33333)"]                                                                                                                                                                                                                                                                                                       

CodePudding user response:

Assuming you soup contains the provided HTML select the elements that holds your information and iterate over the ResultSet to scrape the information. avoid multiple lists, try to scrape all information in one go and save it in a more structured way:

...
data = []

for e in soup.select('.ergov3-txtannonce'):
    data.append({
        'title':e.span.get_text(strip=True),
        'city':e.cite.get_text(strip=True)
    })
...

Note: If the elements are not present in your soup, content of website may provided dynamically by JavaScript - This would be predestined for asking a new question with exact this focus

Example
from bs4 import BeautifulSoup

html='''
<div >
 <div ><span>
 House 3 pièces, 74 m²
 </span>
 <cite>
 New York (11111)
 </cite>
 </div>
</div>,
 <div >
 <div ><span>
 Appartement 3 pièces, 64 m²
 </span>
 <cite>
 Los Angeles (22222)
 </cite>
 </div>
 <div >
 <div ><span>
 House 4 pièces, 81 m²
 </span>
 <cite>
 Chicago (33333)
 </cite>
 </div>
'''
soup = BeautifulSoup(html)

data = []

for e in soup.select('.ergov3-txtannonce'):
    data.append({
        'title':e.span.get_text(strip=True),
        'city':e.cite.get_text(strip=True)
    })

data
Output
[{'title': 'House 3 pièces, 74 m²', 'city': 'New York (11111)'},
 {'title': 'Appartement 3 pièces, 64 m²', 'city': 'Los Angeles (22222)'},
 {'title': 'House 4 pièces, 81 m²', 'city': 'Chicago (33333)'}]
  • Related