Home > front end >  Using BeautifulSoup to pull multiple kml files
Using BeautifulSoup to pull multiple kml files

Time:03-16

I'm learning python and was trying to automate a process which involves going to a site : wildcad net and clicking every single dispatch center, from there loading a kml. I noticed that each page follows a similar format,

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN"
    "http://www.w3.org/TR/html4/frameset.dtd">
    <head>
     <meta content="text/html;charset=utf-8" http-equiv="Content-Type"/>
     <meta content="WildCAD (Brian Booher)" name="GENERATOR"/>
     <title>
      WCAZ-ADC
     </title>
    </head>
    <frameset rows="64,*">
     <frame name="banner" noresize="" scrolling="no" src="WCAZ-ADCtop.htm"/>
     <frameset cols="150,*">
      <frame name="contents" src="WCAZ-ADCleft.htm"/>
      <frame name="main" src="WCAZ-ADCright.htm"/>
     </frameset>
     <noframes>
      <body>
       <p>  
    <a href="http://www.wildcadmap.net/WildCAD_AZ-FDC.kml" target="map"><font size="1">Incident Map (Google Earth)</font></a>

       </p>
      </body>
     </noframes>
    </frameset>

I figured I could use BeautifulSoup and select all. I put in each dispatch center and then selected the search for the 'a' and 'href' since these shouldn't change. The code I wrote is this. However it doesn't seem to identify KML as it's own variable. I'm not quite sure where I went wrong, I'm a bit lost in troubleshooting the next steps. Any pointing in the right direction would be of great help!

from bs4 import BeautifulSoup
    import requests

    urls = ('http://www.wildcad.net/WCAZ-ADC.htm', 'http://www.wildcad.net/WCALAIC.htm',
           'http://www.wildcad.net/WCAR-AOC.htm','http://www.wildcad.net/WCAZ-ADC.htm'
           'http://www.wildcad.net/WCAZ-FDC.htm', 'http://www.wildcad.net/WCAZ-PDC.htm'
           'http://www.wildcad.net/WCAZ-PHC.htm', 'http://www.wildcad.net/WCAZ-SDC.htm'
           'http://www.wildcad.net/WCAZ-TDC.htm', 'http://www.wildcad.net/WCAZ-WDC.htm'
           'http://www.wildcad.net/WCBLMNOC.htm', 'http://www.wildcad.net/WCCA-ANF.htm'
           'http://www.wildcad.net/WCCA-ANF.htm', 'http://www.wildcad.net/WCCA-CNF.htm'
           'http://www.wildcad.net/WCCA-FICC.htm', 'http://www.wildcad.net/WCCA-GVCC.htm'
           'http://www.wildcad.net/WCCA-MICC.htm', 'http://www.wildcad.net/WCCA-ONCC.htm'
           'http://www.wildcad.net/WCCA-OVICC.htm', 'http://www.wildcad.net/WCCA-PNF.htm'
           'http://www.wildcad.net/WCCA-SNF.htm' , 'http://www.wildcad.net/WCCA-STF.htm'
           'http://www.wildcad.net/WCCA-YICC.htm', 'http://www.wildcad.net/WCCA-YNP.htm'
           'http://www.wildcad.net/WCCALPF.htm' , 'http://www.wildcad.net/WCCAMNF.htm'
           'http://www.wildcad.net/WCCANCIC.htm' , 'http://www.wildcad.net/WCCARICC.htm'
           'http://www.wildcad.net/WCCASQCC.htm', 'http://www.wildcad.net/WCCCICC.htm'
           'http://www.wildcad.net/WCCO-CRC.htm' , 'http://www.wildcad.net/WCCO-FTC.htm'
           'http://www.wildcad.net/WCCO-GJC.htm' , 'http://www.wildcad.net/WCCO-MTC.htm'
           'http://www.wildcad.net/WCCODRC.htm' , 'http://www.wildcad.net/WCCOPBC.htm'
           'http://www.wildcad.net/WCFL-FIC.htm' , 'http://www.wildcad.net/WCGAGIC.htm'
           'http://www.wildcad.net/WCID-CDC.htm' , 'http://www.wildcad.net/WCID-GVC.htm'
           'http://www.wildcad.net/WCID-SCC.htm', 'http://www.wildcad.net/WCIDBDC.htm'
           'http://www.wildcad.net/WCIDCIC.htm', 'http://www.wildcad.net/WCIDEIC.htm'
           'http://www.wildcad.net/WCIDPAC.htm' , 'http://www.wildcad.net/WCILILC.htm'
           'http://www.wildcad.net/WCIN-IIC.htm', 'http://www.wildcad.net/WCKY-KIC.htm'
           'http://www.wildcad.net/WCLALIC.htm', 'http://www.wildcad.net/WCMI-MIDC.htm'
           'http://www.wildcad.net/WCMN-MNCC.htm', 'http://www.wildcad.net/WCMOMOC.htm'
           'http://www.wildcad.net/WCMSMIC.htm', 'http://www.wildcad.net/WCMT-BRC.htm'
           'http://www.wildcad.net/WCMT-BZC.htm', 'http://www.wildcad.net/WCMT-DDC.htm'
           'http://www.wildcad.net/WCMT-GDC.htm' 'http://www.wildcad.net/WCMT-HDC.htm'
           'http://www.wildcad.net/WCMT-KDC.htm', 'http://www.wildcad.net/WCMT-KIC.htm'
           'http://www.wildcad.net/WCMT-LEC.htm' , 'http://www.wildcad.net/WCMT-MCC.htm'
           'http://www.wildcad.net/WCMT-MDC.htm', 'http://www.wildcad.net/WCNC-NCC.htm'
           'http://www.wildcad.net/WCNDNDC.htm' , 'http://www.wildcad.net/WCNH-NEC.htm'
           'http://www.wildcad.net/WCNM-ABC.htm' , 'http://www.wildcad.net/WCNM-ADC.htm'
           'http://www.wildcad.net/WCNM-SDC.htm', 'http://www.wildcad.net/?WildWeb=NM-SFC'
           'http://www.wildcad.net/WCNMTDC.htm', 'http://www.wildcad.net/WCNMTDC.htm'
           'http://www.wildcad.net/WCNVCNC.htm' , 'http://www.wildcad.net/WCNVECC.htm'
           'http://www.wildcad.net/WCNVEIC.htm' , 'http://www.wildcad.net/WCNVLIC.htm'
           'http://www.wildcad.net/WCNVSFC.htm', 'http://www.wildcad.net/WCOR-BIC.htm'
           'http://www.wildcad.net/WCOR-COC.htm', 'http://www.wildcad.net/WCOR-EIC.htm'
           'http://www.wildcad.net/WCOR-JDCC.htm', 'http://www.wildcad.net/WCOR-RICC.htm'
           'http://www.wildcad.net/WCOR-RVC.htm', 'http://www.wildcad.net/WCOR-VAC.htm'
           'http://www.wildcad.net/WCORBMC.htm', 'http://www.wildcad.net/WCORLFC.htm'
           'http://www.wildcad.net/WCPA-MACC.htm', 'http://www.wildcad.net/WCSC-SCC.htm'
           'http://www.wildcad.net/WCSC-SRF.htm', 'http://www.wildcad.net/WCSD-GPC.htm'
           'http://www.wildcad.net/WCTN-TNC.htm', 'http://www.wildcad.net/WCTXTIC.htm'
           'http://www.wildcad.net/WCUT-CDC.htm' , 'http://www.wildcad.net/WCUT-MFC.htm'
           'http://www.wildcad.net/WCUT-NUC.htm' , 'http://www.wildcad.net/WCUT-RFC.htm'
           'http://www.wildcad.net/WCUT-UBC.htm' , 'http://www.wildcad.net/WCVAVIC.htm'
           'http://www.wildcad.net/WCWA-CWC.htm', 'http://www.wildcad.net/WCWY-CDC.htm'
           'http://www.wildcad.net/WCWY-CPC.htm',
    result = requests.get(urls)
    doc = BeautifulSoup(result.text, 'html.parser')
    print(doc.prettify())
    for i in enumerate(soup.findAll('a')):
        _KML = urls   link.get('href')
        if _KML.endswith('.kml'):
            urls.append(_KML)

    open(_KML)

CodePudding user response:

If I understood the question,then this is the next working example

doc='''
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN"
    "http://www.w3.org/TR/html4/frameset.dtd">
    <head>
     <meta content="text/html;charset=utf-8" http-equiv="Content-Type"/>
     <meta content="WildCAD (Brian Booher)" name="GENERATOR"/>
     <title>
      WCAZ-ADC
     </title>
    </head>
    <frameset rows="64,*">
     <frame name="banner" noresize="" scrolling="no" src="WCAZ-ADCtop.htm"/>
     <frameset cols="150,*">
      <frame name="contents" src="WCAZ-ADCleft.htm"/>
      <frame name="main" src="WCAZ-ADCright.htm"/>
     </frameset>
     <noframes>
      <body>
       <p>  
    <a href="http://www.wildcadmap.net/WildCAD_AZ-FDC.kml" target="map"><font size="1">Incident Map (Google Earth)</font></a>
    <a href="http://www.wildcadmap.net/WildCAD_AZ-FDC.htm" target="map"><font size="1">Incident Map (Google Earth)</font></a>
       </p>
      </body>
     </noframes>
    </frameset>

'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(doc, 'html.parser')
#print(doc.prettify())
for i in soup.find_all('a'):
    #print(i.get('href'))
    urls = i.get('href')
    if urls.endswith('.kml'):
        kml = urls
        print(kml)

Output:

http://www.wildcadmap.net/WildCAD_AZ-FDC.kml
  • Related