Home > Back-end >  Extract information with the same tag
Extract information with the same tag

Time:11-14

For the Zillow data below, number of beds (bds), number of bath (ba) and square foot (sqfr) have the same tag <li >. How can I get information for these 3 elements. My code below is clearly not working. The result should be: 3 , 2, 1813

Can you please help? Thanks Hong

<div class="list-card-info"><a class="list-card-link list-card-link-top-margin" href="https://www.zillow.com/homedetails/12021-Tralee-Rd-UNIT-102-Lutherville-MD-21093/60873148_zpid/" tabindex="0">
# <address >12021 Tralee Rd UNIT 102, Lutherville, MD 21093</address></a>
# <div ><p >LONG &amp; FOSTER REAL ESTATE, INC.</p></div><div >
# <div >$411,000</div><ul >
# <li >3<abbr > <!-- -->bds</abbr></li>
# <li >2<abbr > <!-- -->ba</abbr></li>
# <li >1,813<abbr > <!-- -->sqft</abbr>
# </li><li >- Apartment for sale</li></ul></div></div>

tag='<div ><a  href="https://www.zillow.com/homedetails/12021-Tralee-Rd-UNIT-102-Lutherville-MD-21093/60873148_zpid/" tabindex="0"><address >12021 Tralee Rd UNIT 102, Lutherville, MD 21093</address></a><div ><p >LONG &amp; FOSTER REAL ESTATE, INC.</p></div><div ><div >$411,000</div><ul ><li >3<abbr > <!-- -->bds</abbr></li><li >2<abbr > <!-- -->ba</abbr></li><li >1,813<abbr > <!-- -->sqft</abbr></li><li >- Apartment for sale</li></ul></div></div>'
tag = BeautifulSoup(tag, 'html.parser')

address = tag.findAll('address', {'class': 'list-card-addr'})
price   = tag.findAll('div', {'class': 'list-card-price'})
beds    = tag.findAll('li', {'class': ""}) 

# keep text only, remove tag
address=address[0].text; 
price=price[0].text ;
beds=beds[0].text; print(beds)
print(address, '---',price, '---',beds)

CodePudding user response:

When you call tag.findAll it creates a ResultSet with all three values saved. You can then access each one using the index number, as shown below.

from bs4 import BeautifulSoup

tag= '<div ><a  href="https://www.zillow.com/homedetails/12021-Tralee-Rd-UNIT-102-Lutherville-MD-21093/60873148_zpid/" tabindex="0"><address >12021 Tralee Rd UNIT 102, Lutherville, MD 21093</address></a><div ><p >LONG &amp; FOSTER REAL ESTATE, INC.</p></div><div ><div >$411,000</div><ul ><li >3<abbr > <!-- -->bds</abbr></li><li >2<abbr > <!-- -->ba</abbr></li><li >1,813<abbr > <!-- -->sqft</abbr></li><li >- Apartment for sale</li></ul></div></div>'

tag = BeautifulSoup(tag, 'html.parser')

tags = tag.findAll('li', {'class': ""})

# keep text only, remove tag
address=tags[0].text;
price=tags[1].text ;
beds=tags[2].text;
print(address, '---',price, '---',beds)

CodePudding user response:

That should do it:

#<div ><a  href="https://www.zillow.com/homedetails/12021-Tralee-Rd-UNIT-102-Lutherville-MD-21093/60873148_zpid/" tabindex="0">
# <address >12021 Tralee Rd UNIT 102, Lutherville, MD 21093</address></a>
# <div ><p >LONG &amp; FOSTER REAL ESTATE, INC.</p></div><div >
# <div >$411,000</div><ul >
# <li >3<abbr > <!-- -->bds</abbr></li>
# <li >2<abbr > <!-- -->ba</abbr></li>
# <li >1,813<abbr > <!-- -->sqft</abbr>
# </li><li >- Apartment for sale</li></ul></div></div>

tag='<div ><a  href="https://www.zillow.com/homedetails/12021-Tralee-Rd-UNIT-102-Lutherville-MD-21093/60873148_zpid/" tabindex="0"><address >12021 Tralee Rd UNIT 102, Lutherville, MD 21093</address></a><div ><p >LONG &amp; FOSTER REAL ESTATE, INC.</p></div><div ><div >$411,000</div><ul ><li >3<abbr > <!-- -->bds</abbr></li><li >2<abbr > <!-- -->ba</abbr></li><li >1,813<abbr > <!-- -->sqft</abbr></li><li >- Apartment for sale</li></ul></div></div>'
tag = BeautifulSoup(tag, 'html.parser')


list_items = tag.findAll('li', {'class': ""})

# keep text only, remove tag
regex = re.compile('([\\d,]*)')
address = regex.findall(list_items[0].text)[0]
price = regex.findall(list_items[1].text)[0]
beds = regex.findall(list_items[2].text)[0]

print(address, '---',price, '---',beds)
  • Related