Home > OS >  How to extract class text based on the previous class text value?
How to extract class text based on the previous class text value?

Time:07-18

I am scraping various information about multiple houses and the information on each house is not similar. To be consistent, I seek to extract the information value based on the class label. For example, I have the following below:

property_info = section_content.find_all('div',{'class':'dc_blocks_2c'})
property_info

This outputs:

<div >
 <div >Bedrooms:</div>
 <div >9 Bedroom(s)</div>
 </div>,
 <div >
 <div >Baths:</div>
 <div >10 Full  &amp; 4 Half Bath(s)</div>
 </div>,
 <div >
 <div >Garage(s):</div>
 <div >4 / Attached</div>
 </div>,
 <div >
 <div >Stories:</div>
 <div >2</div>
 </div>,
:
:
:

To clarify my issue better: some houses don't have the <div >Stories:</div> while some other houses do. If I do the following: property_info = section_content.find_all('div',{'class':'dc_value'}) then yes I get all of the text value I seek however, the list size will not be the same for every house. My pseudocode is what I seek to do:

if dc_label.text LIKE 'Bedrooms'
    then bedroom_num == bedroom_dc_value
if dc_label.text LIKE 'Garage(s)'
    then garage_num == garage_dc_value
if dc_label.text LIKE 'Bath(s)' IS EMPTY:
    then bath_num == ""
:
:
ect,

Any pointers/advice will be appreciated! Thank you!

CodePudding user response:

Assuming you like to store the information in a structured way e.g. dict you could use a generic approach to get all properties. Provided that the property label to be used as the key occurs only once.

In case you like to create a dataframe or csv based on the result, the missing properties will automatically be handled as None:

First collect label and value from you ResultSet:

dict(e.stripped_strings for e in soup.find_all('div',{'class':'dc_blocks_2c'}))

Cause label ends with : we like to strip() that from our keys:

props = {k.strip(':'): v for (k, v) in props.items()}
Example
from bs4 import BeautifulSoup
html='''
<div >
 <div >Bedrooms:</div>
 <div >9 Bedroom(s)</div>
 </div>,
 <div >
 <div >Baths:</div>
 <div >10 Full  &amp; 4 Half Bath(s)</div>
 </div>,
 <div >
 <div >Garage(s):</div>
 <div >4 / Attached</div>
 </div>,
 <div >
 <div >Stories:</div>
 <div >2</div>
</div>
'''

soup = BeautifulSoup(html)

props = dict(e.stripped_strings for e in soup.find_all('div',{'class':'dc_blocks_2c'}))
props = {k.strip(':'): v for (k, v) in props.items()}
props
Output
{'Bedrooms': '9 Bedroom(s)',
 'Baths': '10 Full  & 4 Half Bath(s)',
 'Garage(s)': '4 / Attached',
 'Stories': '2'}
  • Related