How to scrape multiple pages in HTML table with same URL with Python?-CodePudding

I'm trying to scrape the job postings from the following public website:

https://newbraunfels.tedk12.com/hire/Index.aspx

I know there are a few similar questions on here, but I've followed all of them and can't seem to figure it out as my javascript/html skills are limited.

I can get the first page with no issues, but can't seem to access the following three pages.

My best attempt is the following, but it still only returns the first page of listings:

import requests
from bs4 import BeautifulSoup

soup = BeautifulSoup(requests.get(url).content, "html.parser")


def load_page(soup, page_num):
    payload = {
        "__EVENTTARGET": "",
        "__EVENTARGUMENT": "PageIndexNumber${}".format(page_num),
    }
    for inp in soup.select("input"):
        payload[inp["name"]] = inp.get("value")
    soup = BeautifulSoup(requests.post(url, data=payload).content, "lxml")
    
    return soup


# print hospitals from first page:
for jobs in soup.select("table"):
    print(jobs.text)

# load second page
soup = load_page(soup, 2)
for jobs in soup.select("table"):
    print(jobs.text)

Thank you in advanced.

CodePudding user response：

An easier approach in this case might be to query each page directly using get variables. The "StartIndex" variable should be a multiple of 50, as 50 results show on each page. Just increment it by 50 for each page of results to you want to scrape.

Page 1: https://newbraunfels.tedk12.com/hire/Index.aspx?JobListAJAX=Paging&StartIndex=0&ListID=JobList&SearchString=

Page 2: https://newbraunfels.tedk12.com/hire/Index.aspx?JobListAJAX=Paging&StartIndex=50&ListID=JobList&SearchString=

Page 3: https://newbraunfels.tedk12.com/hire/Index.aspx?JobListAJAX=Paging&StartIndex=100&ListID=JobList&SearchString=

..etc.

The returned object is XML, so you will also need to import the document tree into beautiful soup so that you can target elements normally. See here for an example:

https://linuxhint.com/parse_xml_python_beautifulsoup/