Home > Software engineering >  Text classification from html with BeautifulSoup
Text classification from html with BeautifulSoup

Time:12-09

I have obtained html page source code and then parsed it using 'html5lib' with BeautifulSoup.

I have got something like this:

<div  jsaction="mouseover:pane.wfvdle40;mouseout:pane.wfvdle40" jsan="7.V0h1Ob-haAclf,7.OPZbO-KE6vqe,7.o0s21d-HiaYvf,0.jsaction" jstcache="824">
    <a aria-label="Muzeum Londynu"  href="https://www.google.com/maps/place/Muzeum Londynu/data=!4m5!3m4!1s0x48761b5508c1cbeb:0x407de2c1952a25e4!8m2!3d51.5176183!4d-0.0967782?authuser=0&amp;hl=pl&amp;rclk=1" jsaction="pane.wfvdle40;focus:pane.wfvdle40;blur:pane.wfvdle40;auxclick:pane.wfvdle40;contextmenu:pane.wfvdle40;keydown:pane.wfvdle40;clickmod:pane.wfvdle40" jsan="7.a4gq8e-aVTXAb-haAclf-jRmmHf-hSRGPd,0.aria-label,8.href,0.jsaction" jstcache="825"></a>
    <div  jstcache="826"></div>
    <div aria-label="Muzeum Londynu"  jsan="7.MVVflb-haAclf,7.V0h1Ob-haAclf-d6wfac,7.MVVflb-haAclf-uxVfW-hSRGPd,0.aria-label" jstcache="827">
        <div  jstcache="828"></div>
        <div >
            <div  jstcache="829">
                <div  jsan="t-pdDsP4P8DQQ,7.RnEfrd-jRmmHf-HSrbLb,7.B9Hcub-QFlW2" jstcache="933">
                    <button jstcache="842" style="display:none"></button>
                    <div  jsan="7.Z8fK3b,t-MjeqqY5XOdM" jstcache="843"> 
                        <div > <div >
                            <div  jsan="7.qBF1Pd,7.gm2-subtitle-alt-1,t-u3p6PfXaXm4" jstcache="845">
                                <span jstcache="858">Muzeum Londynu</span> 
                            </div>
                            <h1 jstcache="846" style="display:none"></h1> 
                            <span ></span> 
                        </div> 
                        <div  jstcache="847"></div> 
                        <div  jsan="7.ZY2y6b-RWgCYc,t-hEqDOx2FFV0" jstcache="848"> 
                        <div > 
                            <span  jstcache="860"></span>
                            <span  jsan="t-CJ3Gw1VPbAA,7.gm2-body-2" jstcache="861">
                            <span jstcache="868" style="display:none"></span>
                            <span aria-label=" 4,6-gwiazdkowy  Opinie (13 898)  "  jsan="7.ZkP5Je,0.aria-label,0.role,t-kqtGnPs-9G0" jstcache="869" role="group">
                            <span aria-hidden="true"  jsan="7.MW4etd,0.aria-hidden" jstcache="872">4,6</span>
                            <div jstcache="873" style="display:none"></div>
                            <div  jsan="7.QBUL8c" jsinstance="0" jstcache="874"></div>
                            <div  jsan="7.QBUL8c" jsinstance="1" jstcache="874"></div>
                            <div  jsan="7.QBUL8c" jsinstance="2" jstcache="874"></div>
                            <div  jsan="7.QBUL8c" jsinstance="3" jstcache="874"></div>
                            <div  jsan="7.QBUL8c,7.cXOKEb-S62Q7b" jsinstance="*4" jstcache="874"></div> 
                            <span aria-hidden="true"  jsan="7.UY7F9,0.aria-hidden" jstcache="875">(13 898)</span>
                        </span>
                     </span> 
                     <span jstcache="862" style="display:none">
                         <jsl jstcache="863" style="display:none"></jsl> 
                     </span> 
                 </div> 
             </div> 
             <div > 
                 <span jstcache="849" style="display:none"></span> 
                 <div  jsinstance="0" jstcache="850"> 
                     <span jsinstance="0" jstcache="851">
                          <jsl jstcache="852"> <span jstcache="884" style="display:none">·</span> 
                          <span jstcache="885">Muzeum</span> <span jstcache="886" style="display:none"></span> </jsl> </span><span jsinstance="*1" jstcache="851"> <jsl jstcache="852"> <span aria-hidden="true"  jsan="7.bXlT7b-hgDUwe,0.aria-hidden" jstcache="884">·</span> <span jstcache="885">150 London Wall</span> <span jstcache="886" style="display:none"></span> </jsl> </span> </div><div  jsinstance="1" jstcache="850"> <span jsinstance="*0" jstcache="851"> <jsl jstcache="852"> <span jstcache="884" style="display:none">·</span> <span jstcache="885">Historia Londynu od starożytności do dziś</span> <span jstcache="886" style="display:none"></span> </jsl> </span> </div><div  jsinstance="*2" jstcache="850"> <span jsinstance="*0" jstcache="851"> <jsl jstcache="852"> <span jstcache="884" style="display:none">·</span> <span jstcache="885">Zamknięcie: 17:00</span> <span jstcache="886" style="display:none"></span> </jsl> </span> </div> </div> </div> </div></div></div><div  jstcache="830"></div><div  jstcache="831"><div  jsan="t-PLs0ILPSy_c,7.xwpmRb,7.qisNDe,5.width,5.height,5.margin-top,5.margin-bottom,5.margin-left,5.margin-right" jstcache="932" style="width: 84px; height: 84px; margin: 0px;"><div  jsan="7.p0Hhde,7.Vig8jf-haAclf,5.min-width,5.min-height" jstcache="836" style="min-width:84px;min-height:84px"><img aria-hidden="true" decoding="async" src="//lh5.googleusercontent.com/proxy/tWfK1sqsGJZNlZu3WTUika5NJAu4mqKhx07Kub2ZjC_yU3PdIv3DWCKe8_cwJ3RBAUHjW5qZp3S6vGLQJ7HnYxCL_4YR4X1T3ju-ISh86JeC5Kb0KGnvp8j8Jt0vvk6Es_gdVz1AyfBfMDSN6DImwkgbwPL0RQ=w138-h92-k-no" style="position: absolute; top: 50%;left: 50%;width: 126px;height: 84px;-webkit-transform: translateY(-50%) translateX(-50%);transform: translateY(-50%) translateX(-50%);"/></div><button jstcache="837" style="display:none"></button><div ></div></div></div><div  jstcache="832"></div></div><div  jstcache="833"></div></div></div>

The last part was running methong .find_all('a', href=True) which got me something like this:

[<a aria-label="Muzeum Londynu"  href="https://www.google.com/maps/place/Muzeum Londynu/data=!4m5!3m4!1s0x48761b5508c1cbeb:0x407de2c1952a25e4!8m2!3d51.5176183!4d-0.0967782?authuser=0&amp;hl=pl&amp;rclk=1" jsaction="pane.wfvdle40;focus:pane.wfvdle40;blur:pane.wfvdle40;auxclick:pane.wfvdle40;contextmenu:pane.wfvdle40;keydown:pane.wfvdle40;clickmod:pane.wfvdle40" jsan="7.a4gq8e-aVTXAb-haAclf-jRmmHf-hSRGPd,0.aria-label,8.href,0.jsaction" jstcache="825"></a>]

I am trying to specifically extract longitude and latitude which are [51.5176183, -0.0967782] present in the href.

I've tried using .href method similar to .text method but when i am using .href 'None' is being returned. Could you tell me how to extract those two velues from href body?

Running .text method on the html code returning output like this:

Museum of London         4,6(13 898)           · Museum     · 150 London Wall       · The history of London from antiquity to today       · Closing: 17:00      

CodePudding user response:

According to your question, I use split() method to get the desired output.

script

html='''
<html>
 <head>
 </head>
 <body>
  <a aria-label="Muzeum Londynu"  href="https://www.google.com/maps/place/Muzeum Londynu/data=!4m5!3m4!1s0x48761b5508c1cbeb:0x407de2c1952a25e4!8m2!3d51.5176183!4d-0.0967782?authuser=0&amp;hl=pl&amp;rclk=1" jsaction="pane.wfvdle40;focus:pane.wfvdle40;blur:pane.wfvdle40;auxclick:pane.wfvdle40;contextmenu:pane.wfvdle40;keydown:pane.wfvdle40;clickmod:pane.wfvdle40" jsan="7.a4gq8e-aVTXAb-haAclf-jRmmHf-hSRGPd,0.aria-label,8.href,0.jsaction" jstcache="825">
  </a>
 </body>
</html>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'html5lib')
#print(soup.prettify())
href=soup.find("a",class_="a4gq8e-aVTXAb-haAclf-jRmmHf-hSRGPd").get('href')
lat_lan=','.join(href.split('/')[-1].split('?')[0].split(':')[-1].split('!')[2:]).replace('3d','').replace('4d','').split()
print(lat_lan)

Output

['51.5176183', '-0.0967782']
  • Related