Home > front end >  How to webscrape old school website that uses frames
How to webscrape old school website that uses frames

Time:11-08

I am trying to webscrape a government site that uses frameset. Here is the URL - https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults/index.htm

I've tried using splinter/selenium

url = "https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults/index.htm"

browser.visit(url)

time.sleep(10)

full_xpath_frame = '/html/frameset/frameset/frame[2]'

tree = browser.find_by_xpath(full_xpath_frame)

for i in tree:
    print(i.text) 

It just returns an empty string. I've tried using the requests library.


import requests
from lxml import HTML

url = "https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults/index.htm"

# get response object
response = requests.get(url)
 
# get byte string
data = response.content
print(data)

And it returns this

b"<html>\r\n<head>\r\n<meta http-equiv='Content-Type'\r\ncontent='text/html; charset=iso-

8859-1'>\r\n<title>Lake_ County Election Results</title>\r\n</head>\r\n<FRAMESET rows='20%,

 *'>\r\n<FRAME src='titlebar.htm' scrolling='no'>\r\n<FRAMESET cols='20%, *'>\r\n<FRAME 

src='menu.htm'>\r\n<FRAME src='Lake_ElecSumm_all.htm' name='reports'>\r\n</FRAMESET>

\r\n</FRAMESET>\r\n<body>\r\n</body>\r\n</html>\r\n"

I've also tried using beautiful soup and it gave me the same thing. Is there another python library I can use in order to get the data that's inside the second table?

Thank you for any feedback.

CodePudding user response:

As mentioned you could go for the frames and its src:

BeautifulSoup(r.text).select('frame')[1].get('src')

or directly to the menu.htm:

import requests
from bs4 import BeautifulSoup

r = requests.get('https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults/menu.htm')

link_list = ['https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults' a.get('href') for a in BeautifulSoup(r.text).select('a')]

for link in link_list[:1]:
    r = requests.get(link)
    soup = BeautifulSoup(r.text)
    ###...scrape what is needed
  • Related