I have obtained html page source code and then parsed it using 'html5lib' with BeautifulSoup.
I have got something like this:
<div jsaction="mouseover:pane.wfvdle40;mouseout:pane.wfvdle40" jsan="7.V0h1Ob-haAclf,7.OPZbO-KE6vqe,7.o0s21d-HiaYvf,0.jsaction" jstcache="824">
<a aria-label="Muzeum Londynu" href="https://www.google.com/maps/place/Muzeum Londynu/data=!4m5!3m4!1s0x48761b5508c1cbeb:0x407de2c1952a25e4!8m2!3d51.5176183!4d-0.0967782?authuser=0&hl=pl&rclk=1" jsaction="pane.wfvdle40;focus:pane.wfvdle40;blur:pane.wfvdle40;auxclick:pane.wfvdle40;contextmenu:pane.wfvdle40;keydown:pane.wfvdle40;clickmod:pane.wfvdle40" jsan="7.a4gq8e-aVTXAb-haAclf-jRmmHf-hSRGPd,0.aria-label,8.href,0.jsaction" jstcache="825"></a>
<div jstcache="826"></div>
<div aria-label="Muzeum Londynu" jsan="7.MVVflb-haAclf,7.V0h1Ob-haAclf-d6wfac,7.MVVflb-haAclf-uxVfW-hSRGPd,0.aria-label" jstcache="827">
<div jstcache="828"></div>
<div >
<div jstcache="829">
<div jsan="t-pdDsP4P8DQQ,7.RnEfrd-jRmmHf-HSrbLb,7.B9Hcub-QFlW2" jstcache="933">
<button jstcache="842" style="display:none"></button>
<div jsan="7.Z8fK3b,t-MjeqqY5XOdM" jstcache="843">
<div > <div >
<div jsan="7.qBF1Pd,7.gm2-subtitle-alt-1,t-u3p6PfXaXm4" jstcache="845">
<span jstcache="858">Muzeum Londynu</span>
</div>
<h1 jstcache="846" style="display:none"></h1>
<span ></span>
</div>
<div jstcache="847"></div>
<div jsan="7.ZY2y6b-RWgCYc,t-hEqDOx2FFV0" jstcache="848">
<div >
<span jstcache="860"></span>
<span jsan="t-CJ3Gw1VPbAA,7.gm2-body-2" jstcache="861">
<span jstcache="868" style="display:none"></span>
<span aria-label=" 4,6-gwiazdkowy Opinie (13 898) " jsan="7.ZkP5Je,0.aria-label,0.role,t-kqtGnPs-9G0" jstcache="869" role="group">
<span aria-hidden="true" jsan="7.MW4etd,0.aria-hidden" jstcache="872">4,6</span>
<div jstcache="873" style="display:none"></div>
<div jsan="7.QBUL8c" jsinstance="0" jstcache="874"></div>
<div jsan="7.QBUL8c" jsinstance="1" jstcache="874"></div>
<div jsan="7.QBUL8c" jsinstance="2" jstcache="874"></div>
<div jsan="7.QBUL8c" jsinstance="3" jstcache="874"></div>
<div jsan="7.QBUL8c,7.cXOKEb-S62Q7b" jsinstance="*4" jstcache="874"></div>
<span aria-hidden="true" jsan="7.UY7F9,0.aria-hidden" jstcache="875">(13 898)</span>
</span>
</span>
<span jstcache="862" style="display:none">
<jsl jstcache="863" style="display:none"></jsl>
</span>
</div>
</div>
<div >
<span jstcache="849" style="display:none"></span>
<div jsinstance="0" jstcache="850">
<span jsinstance="0" jstcache="851">
<jsl jstcache="852"> <span jstcache="884" style="display:none">·</span>
<span jstcache="885">Muzeum</span> <span jstcache="886" style="display:none"></span> </jsl> </span><span jsinstance="*1" jstcache="851"> <jsl jstcache="852"> <span aria-hidden="true" jsan="7.bXlT7b-hgDUwe,0.aria-hidden" jstcache="884">·</span> <span jstcache="885">150 London Wall</span> <span jstcache="886" style="display:none"></span> </jsl> </span> </div><div jsinstance="1" jstcache="850"> <span jsinstance="*0" jstcache="851"> <jsl jstcache="852"> <span jstcache="884" style="display:none">·</span> <span jstcache="885">Historia Londynu od starożytności do dziś</span> <span jstcache="886" style="display:none"></span> </jsl> </span> </div><div jsinstance="*2" jstcache="850"> <span jsinstance="*0" jstcache="851"> <jsl jstcache="852"> <span jstcache="884" style="display:none">·</span> <span jstcache="885">Zamknięcie: 17:00</span> <span jstcache="886" style="display:none"></span> </jsl> </span> </div> </div> </div> </div></div></div><div jstcache="830"></div><div jstcache="831"><div jsan="t-PLs0ILPSy_c,7.xwpmRb,7.qisNDe,5.width,5.height,5.margin-top,5.margin-bottom,5.margin-left,5.margin-right" jstcache="932" style="width: 84px; height: 84px; margin: 0px;"><div jsan="7.p0Hhde,7.Vig8jf-haAclf,5.min-width,5.min-height" jstcache="836" style="min-width:84px;min-height:84px"><img aria-hidden="true" decoding="async" src="//lh5.googleusercontent.com/proxy/tWfK1sqsGJZNlZu3WTUika5NJAu4mqKhx07Kub2ZjC_yU3PdIv3DWCKe8_cwJ3RBAUHjW5qZp3S6vGLQJ7HnYxCL_4YR4X1T3ju-ISh86JeC5Kb0KGnvp8j8Jt0vvk6Es_gdVz1AyfBfMDSN6DImwkgbwPL0RQ=w138-h92-k-no" style="position: absolute; top: 50%;left: 50%;width: 126px;height: 84px;-webkit-transform: translateY(-50%) translateX(-50%);transform: translateY(-50%) translateX(-50%);"/></div><button jstcache="837" style="display:none"></button><div ></div></div></div><div jstcache="832"></div></div><div jstcache="833"></div></div></div>
The last part was running methong .find_all('a', href=True) which got me something like this:
[<a aria-label="Muzeum Londynu" href="https://www.google.com/maps/place/Muzeum Londynu/data=!4m5!3m4!1s0x48761b5508c1cbeb:0x407de2c1952a25e4!8m2!3d51.5176183!4d-0.0967782?authuser=0&hl=pl&rclk=1" jsaction="pane.wfvdle40;focus:pane.wfvdle40;blur:pane.wfvdle40;auxclick:pane.wfvdle40;contextmenu:pane.wfvdle40;keydown:pane.wfvdle40;clickmod:pane.wfvdle40" jsan="7.a4gq8e-aVTXAb-haAclf-jRmmHf-hSRGPd,0.aria-label,8.href,0.jsaction" jstcache="825"></a>]
I am trying to specifically extract longitude and latitude which are [51.5176183, -0.0967782] present in the href.
I've tried using .href method similar to .text method but when i am using .href 'None' is being returned. Could you tell me how to extract those two velues from href body?
Running .text method on the html code returning output like this:
Museum of London 4,6(13 898) · Museum · 150 London Wall · The history of London from antiquity to today · Closing: 17:00
CodePudding user response:
According to your question, I use split() method to get the desired output.
script
html='''
<html>
<head>
</head>
<body>
<a aria-label="Muzeum Londynu" href="https://www.google.com/maps/place/Muzeum Londynu/data=!4m5!3m4!1s0x48761b5508c1cbeb:0x407de2c1952a25e4!8m2!3d51.5176183!4d-0.0967782?authuser=0&hl=pl&rclk=1" jsaction="pane.wfvdle40;focus:pane.wfvdle40;blur:pane.wfvdle40;auxclick:pane.wfvdle40;contextmenu:pane.wfvdle40;keydown:pane.wfvdle40;clickmod:pane.wfvdle40" jsan="7.a4gq8e-aVTXAb-haAclf-jRmmHf-hSRGPd,0.aria-label,8.href,0.jsaction" jstcache="825">
</a>
</body>
</html>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'html5lib')
#print(soup.prettify())
href=soup.find("a",class_="a4gq8e-aVTXAb-haAclf-jRmmHf-hSRGPd").get('href')
lat_lan=','.join(href.split('/')[-1].split('?')[0].split(':')[-1].split('!')[2:]).replace('3d','').replace('4d','').split()
print(lat_lan)
Output
['51.5176183', '-0.0967782']