Home > Enterprise >  How to #document frames in beautifulsoup that has microsoft excel schema?
How to #document frames in beautifulsoup that has microsoft excel schema?

Time:10-01

As the title says, I'm scraping a website that has a set of list of schools. Clicking on it, redirects you to another website of .htm that uses of xmlns:urn:schemas-microsoft-com:office:excel.

All that I want is to access the name of the school, email and it's website which I believe I can do on my own which I'll later be exporting the same into a csv file. But the thing is, I cannot access the table by any means and trying gives me None as the output.

The main website: https://myschoolchildren.com/list-of-all-secondary-schools-in-malaysia/#.YzWrtXZBy3A First link of that website: https://myschoolchildren.com/data/SEK_MEN_Johor.htm

Here's my work on it so far (entire code has been shared):

import requests
from bs4 import BeautifulSoup


def write(file_name, data_type):
    with open(file_name, "a") as requirement:
        requirement.write("%s\n" % data_type)


def url_parser(url):
    html_doc = requests.get(url).text
    soup = BeautifulSoup(html_doc, 'html.parser')
    return soup


def lxml_url_parser(url):
    html_doc = requests.get(url)
    soup = BeautifulSoup(html_doc.text, 'lxml')
    return soup


def data_fetch(url):
    soup = url_parser(url)
    links = soup.find(class_='entry-content').find_all('a')
    for link in links:
        web = link.get('href')
        soup2 = lxml_url_parser(web)
        #school_name = soup2.find('tbody').find_all('tr')
        print(soup2)
        #print(school_name)
        break


def main():
    url = "https://myschoolchildren.com/list-of-all-secondary-schools-in-malaysia/#.YzWrtXZBy3A"
    data_fetch(url)


if __name__ == "__main__":
    main()

I've no idea about where am I going wrong.. All that I want is the name, email and the website of the school. Any suggestions?

CodePudding user response:

Try to change

html_doc = requests.get(url)

to

html_doc = requests.get(url.replace('.htm', '_files/sheet001.htm'))

When the page is loaded, it is from here that the table is dynamically loaded

  • Related