Home > Enterprise >  Python - Get a list of URLs from a complicated html file for scarping purposes
Python - Get a list of URLs from a complicated html file for scarping purposes

Time:12-16

I am new to web scraping and could not get the list of URLs in the 'a' tags from this website: http://www.tauntondevelopment.org//msip/JHRindex.htm. All I get is an empty list- clients list: [] Thank you for your help!

Here is my code:

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

# This is the url of one major industrial park that we will be scraping
park_url = "http://www.tauntondevelopment.org//msip/JHRindex.htm"

uPark = uReq(park_url)
park_html = uPark.read()
uPark.close()

park_soup = soup(park_html, "html.parser")

filename = "ParkText.html"
f = open(filename, "w") 
f.write(park_soup.prettify())
f.close()

# get a list of the urls of park_url    
clients_list = []
for link in park_soup.findAll('li'):
    clients_list.append(link.get('href'))

print("clients list:", clients_list)

# write clients to a file 
filename = "taunton_JHR.csv"

f = open(filename, "w") # 
headers = "Name, Email, Address\n"
f.write(headers)
 
for client_url in clients_list:
    # call the function to scrape the individual park data
    client_url = "http://www.tauntondevelopment.org/msip/"   client_url
    try: 
        uClient = uReq(client_url)
    except:
        print("Error: Unable to open url")
        continue # continue to the next client_url in the list

    client_name, client_email, client_address = scrapeIndPark(uClient)
    
    f.write(client_name    ","   client_email   ","   client_address   "\n")
    
f.close()


CodePudding user response:

Did you try to look into html you downloading?

<html>
 <head>
  <title>
   John Hancock Road
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
 </head>
 <frameset bordercolor="#E0E0E0" cols="25%,591*">
  <frame name="index" src="JHRleft.htm" target="content"/>
  <frame name="content" src="JHRright.htm"/>
 </frameset>
 <noframes>
  <body bgcolor="#FFFFFF">
  </body>
 </noframes>
 <frameset>
 </frameset>
</html>

Notice that (at least in my case) it's empty! It's because page is builded with frames. To access frame you need to go to page, run network inspector, go to network tab and see url the latter requests (filling the frames with data) are sent. In that case the url you searching for is probably http://www.tauntondevelopment.org//msip/JHRleft.htm

CodePudding user response:

In your code you're trying to get the href attribute from the li elements themselves. Actually the li element has a nested p with a nested b which has a nested inside, you need to get that nested a.

Here is a suggestion:

clients_list = []
for link in park_soup.findAll('li'):
    href_attr = link.findAll('p')[0].findAll('b')[0].findAll('a').get('href')
    clients_list.append(href_attr)

Another ideia would be to go straight for all a tags:

clients_list = []
for link in park_soup.findAll('a'):
    href_attr = link.get('href')
    clients_list.append(href_attr)
  • Related