With the following url and soup, I have the following and I seek to webscrape the Subdivision Information Section. I have copied the html portion for one house below:
house_url = 'https://www.har.com/homedetail/2701-main-st-1910-houston-tx-77002/15331551'
house_response = requests.get(url=house_url, headers=your_header)
house_soup = BeautifulSoup(house_response.text, 'html.parser').find('div', {'class':'pt-2 pb-2 mr-4 pr-md-5 ml-4 pl-md-5'})
Subdivision Section HTML
<div id="subDivisonInfo" data-contentname="subdivision-facts"><div >
<h2 tabindex="0">Subdivision Facts</h2>
<a href="/geomarketarea/100_midtown---houston">View Neighborhood Profile </a>
<div >
<a href="/geomarketarea/100_midtown---houston">
<div style="height: 360px; width: 100%; background-size: cover; background-repeat: no-repeat; background-position: center center; background-image: url("https://api.mapbox.com/styles/v1/mapbox/streets-v11/static/path-1 0000ff-0.45 0000ff-0.45(u|stDfobeQnCbExAhAdCx@xBXvDDlA@lMF~CBvBEhJOVAnDo@dDiClByCpCkEh`@sv@zx@vi@rGhEx@jAXF^lAB`@Bx@Dx@?fBPlBLxA`@fL^tMNhGFlBJdEDxAHlANdCHbBPpDDpA?hDFzE@xB@zADbBa@sB[k@q@gA}A}Aw@k@yAcAqBwAyAgAiBuAyCyBoBuAmA{@u@i@{BaB{C{AsBc@kF]{E?{DHqCIuESk@?CvCAv@?tCJ~K?lA}LB?{@yJB_K@_E?wJ@mX@yA@sIFgM@?kDOcCm@_IeBgG??aAyDs@_EWgBGsASuDQiNGoNKwJ??QuH~ClJjCfG)/auto/651x360?access_token=pk.eyJ1IjoiaGFyZGV2ZXJpY2siLCJhIjoiY2sxZ3FuNWJpMDFtbDNjbDJ0bnJnbnpkdyJ9.byj8yrbalnyCw4u9TNwYuA");">
<img src="https://content.harstatic.com/img/common/loading1.gif" style="display: none;">
</div>
<script type="text/javascript">
/*! domready (c) Dustin Diaz 2014 - License MIT */
;!function(e,t){"undefined"!=typeof module?module.exports=t():"function"==typeof define&&"object"==typeof define.amd?define(t):this.domready=t()}(0,function(){var e,t=[],o="object"==typeof document&&document,n=o&&o.documentElement.doScroll,d=o&&(n?/^loaded|^c/:/^loaded|^i|^c/).test(o.readyState);return!d&&o&&o.addEventListener("DOMContentLoaded",e=function(){for(o.removeEventListener("DOMContentLoaded",e),d=1;e=t.shift();)e()}),function(e){d?setTimeout(e,0):t.push(e)}});
</script>
<script type="text/javascript">
domready(function() {
HARMap.load().then(function(module) {
var componentId = 'image24906579';
var polygon = 'POLYGON((-95.372842651 29.762188072,-95.373816894 29.761474216,-95.374191992 29.76101599,-95.374483311 29.760352652,-95.374609142 29.759738202,-95.374640089 29.758819359,-95.374647572 29.758426112,-95.374691499 29.756117694,-95.374706659 29.75532102,-95.3746799 29.754718062,-95.374599516 29.752906601,-95.374594347 29.752790096,-95.374351409 29.751909331,-95.373661419 29.751078253,-95.372887724 29.750528187,-95.371869687 29.74980439,-95.362970840876 29.744465093909,-95.369806756 29.735213416,-95.370819903 29.733833779,-95.371197558 29.733537028,-95.371239769 29.733411918,-95.371629671 29.733245349,-95.371804383 29.73323255,-95.372090663 29.733211576,-95.372379167 29.733175792,-95.372896911 29.733184661,-95.373448298 29.733085864,-95.373897555 29.73302357,-95.376020952 29.732848991,-95.378367141 29.732692501,-95.379698574 29.732605591,-95.380251649 29.73256989,-95.381236514 29.732506316,-95.381686127 29.732477294,-95.382077218 29.732432106,-95.382753475 29.732353969,-95.383254109 29.732299798,-95.384141891 29.732214015,-95.384547373 29.732184995,-95.385401694 29.732177061,-95.386504599 29.7321448,-95.387113815 29.732128476,-95.38757473 29.732116124,-95.388065951 29.732085741,-95.387487144 29.732263493,-95.387266095 29.732397447,-95.386907035 29.732649085,-95.386438833 29.733120089,-95.386221661 29.733399038,-95.385875385 29.733846224,-95.385438799 29.734415138,-95.38508092 29.734869849,-95.384651063 29.735397347,-95.384037746 29.736172341,-95.383612227 29.736729284,-95.383311768 29.737122539,-95.383099783 29.737389038,-95.382608073 29.738007194,-95.382145211 29.738793482,-95.381972784 29.739372769,-95.381824272 29.740550975,-95.381818116 29.741650627,-95.381867014 29.742586972,-95.381820616 29.743316543,-95.38172399 29.744387766,-95.381721351 29.744610716,-95.382483057 29.744631286,-95.382763059 29.744636527,-95.383509646 29.74463557,-95.385588363 29.744584135,-95.385979746 29.744575193,-95.386003821 29.746807942,-95.385699411 29.746810253,-95.385718338 29.748696241,-95.385725089 29.750624573,-95.385732125 29.75158054,-95.385735183 29.753459289,-95.385748282 29.757531403,-95.385758824 29.757976236,-95.385799269 29.75968271,-95.385808634 29.76196431,-95.384952318 29.761964306,-95.384292125 29.762037035,-95.382689529 29.76226967,-95.381370964 29.762781324,-95.381373603 29.762782609,-95.380443808 29.763114063,-95.379476498 29.763374528,-95.378959294 29.763485938,-95.378540344 29.763528353,-95.377633441 29.763629227,-95.375180147 29.763724096,-95.37270141 29.763764494,-95.370824522 29.763815179,-95.370818039 29.76381558,-95.369274785 29.76391082,-95.371101581 29.763114639,-95.372416403 29.762414919,-95.372842651 29.762188072))';
var node = $('.' componentId).removeClass(componentId);
// var result = module.StaticMap.custom.withPolygon(node.width(), node.height(), polygon)
// result.backgroundImage(node);
var result = module.StaticMap.custom.withPolygon(node.width(), node.height(), polygon)
result.backgroundImage(node);
/*var geometry = module.geometry;
var points = geometry.pointsFromWKT(polygon);
//console.log(points);
if(points.length > 100) { points = geometry.simplifyPolygon(points, 0.0001); }
if(points.length > 100) { points = geometry.simplifyPolygon(points, 0.001); }
//console.log(points);
var encString = geometry.encodePath(points);
var width = node.width();
var height = node.height();
if(!width) { console.error('width cannot be empty!'); }
if(!height) { console.error('height cannot be empty!'); }
var path = encodeURIComponent("weight:1|fillcolor:blue|enc:" encString);
var url = "/api/staticmap?size=" width "x" height "&path=" path "&client=gme-houstonrealtorsinformation";
// alert(url);
//$(node).html('<a href="' url '" id="hoodMapStaticLink"></a><img />');
var image = new Image();
image.onload = image.onerror = function() { node.find('img').remove(); }
image.src = url;
$(node).css('background-image', 'url(' url ')');*/
});
});
</script> </a>
</div>
<h3 tabindex="0">Facts (Based on Active listings)</h3>
<div >
<div >
<div >Market Area Name</div>
<div >Midtown - Houston</div>
</div>
<div >
<div >Home For Sales</div>
<div >104</div>
</div>
<div >
<div >Average List Price</div>
<div >$428,844</div>
</div>
<div >
<div >Average Bedrooms</div>
<div >2.27</div>
</div>
<div >
<div >Average Baths</div>
<div >2.07</div>
</div>
<div >
<div >Average Sqft</div>
<div >1,873</div>
</div>
<div >
<div >Average Price/Sqft</div>
<div >$236.48</div>
</div>
<div >
<div >Home For Lease</div>
<div >96</div>
</div>
<div >
<div >Average Lease</div>
<div >$2,396</div>
</div>
<div >
<div >Average Lease/Sqft</div>
<div >$1.76</div>
</div>
</div>
</div>
</div>
However, whenever I use beautifulSoup to get the text such as "Average List Price:$428,844", This is the output I get:
house_soup.find('div',{'id':'subDivisonInfo'}).find('div',{'class':'row'}).findAll('div',{'class':'col-md-4 col-6 mb-4'})[0].getText()
'\n-----------\n-----------\n'
I am not sure why it is returning this string instead of the actual text?
CodePudding user response:
The required data is loaded from external source via AJAX.So you have to use API url instead.
import requests
from bs4 import BeautifulSoup
api_url= 'https://www.har.com/api/getSubdivisionFacts/15331551'
req=requests.get(api_url).text
#print(req)
soup= BeautifulSoup(req,'lxml')
price = soup.select_one('[] > div:-soup-contains("Average List Price")').find_next_sibling('div')
print(price.text)
Output:
$428,844
CodePudding user response:
Because there is a script executed that gets the data when you open the url in the browser. Try performing a get request in python and check the html contents. The initial html does not contain the details you are looking for such as "Average Listing Price".