I am scraping various information about multiple houses and the information on each house is not similar. To be consistent, I seek to extract the information value based on the class label. For example, I have the following below:
property_info = section_content.find_all('div',{'class':'dc_blocks_2c'})
property_info
This outputs:
<div >
<div >Bedrooms:</div>
<div >9 Bedroom(s)</div>
</div>,
<div >
<div >Baths:</div>
<div >10 Full & 4 Half Bath(s)</div>
</div>,
<div >
<div >Garage(s):</div>
<div >4 / Attached</div>
</div>,
<div >
<div >Stories:</div>
<div >2</div>
</div>,
:
:
:
To clarify my issue better: some houses don't have the <div >Stories:</div>
while some other houses do.
If I do the following: property_info = section_content.find_all('div',{'class':'dc_value'})
then yes I get all of the text value I seek however, the list size will not be the same for every house. My pseudocode is what I seek to do:
if dc_label.text LIKE 'Bedrooms'
then bedroom_num == bedroom_dc_value
if dc_label.text LIKE 'Garage(s)'
then garage_num == garage_dc_value
if dc_label.text LIKE 'Bath(s)' IS EMPTY:
then bath_num == ""
:
:
ect,
Any pointers/advice will be appreciated! Thank you!
CodePudding user response:
Assuming you like to store the information in a structured way e.g. dict
you could use a generic approach to get all properties. Provided that the property label to be used as the key occurs only once.
In case you like to create a dataframe
or csv
based on the result, the missing properties will automatically be handled as None
:
First collect label and value from you ResultSet
:
dict(e.stripped_strings for e in soup.find_all('div',{'class':'dc_blocks_2c'}))
Cause label ends with :
we like to strip()
that from our keys:
props = {k.strip(':'): v for (k, v) in props.items()}
Example
from bs4 import BeautifulSoup
html='''
<div >
<div >Bedrooms:</div>
<div >9 Bedroom(s)</div>
</div>,
<div >
<div >Baths:</div>
<div >10 Full & 4 Half Bath(s)</div>
</div>,
<div >
<div >Garage(s):</div>
<div >4 / Attached</div>
</div>,
<div >
<div >Stories:</div>
<div >2</div>
</div>
'''
soup = BeautifulSoup(html)
props = dict(e.stripped_strings for e in soup.find_all('div',{'class':'dc_blocks_2c'}))
props = {k.strip(':'): v for (k, v) in props.items()}
props
Output
{'Bedrooms': '9 Bedroom(s)',
'Baths': '10 Full & 4 Half Bath(s)',
'Garage(s)': '4 / Attached',
'Stories': '2'}