Scraping a webpage with Python but unsure how to deal with a static(?) URL-CodePudding

I am trying to learn how to pull data from this url: https://denver.coloradotaxsale.com/index.cfm?folder=auctionResults&mode=preview

However, the problem is that the URL doesn't change when I am trying to switch pages so I am not exactly sure how to enumerate or loop through it. Trying to find a better way since the webpage has 3 thousand datapoints of sales.

Here is my starting code it is very simple but I would appreciate any help that can be given or any hints. I think I might need to change to another package but I am not sure which one maybe beautifulsoup?

import requests 
url = "https://denver.coloradotaxsale.com/index.cfm?folder=auctionResults&mode=preview"

html = requests.get(url).content
df_list = pd.read_html(html,header = 1)[0]
df_list = df_list.drop([0,1,2]) #Drop unnecessary rows

CodePudding user response：

To get the data from more pages you can use this example:

import requests
import pandas as pd
from bs4 import BeautifulSoup


data = {
    "folder": "auctionResults",
    "loginID": "00",
    "pageNum": "1",
    "orderBy": "AdvNum",
    "orderDir": "asc",
    "justFirstCertOnGroups": "1",
    "doSearch": "true",
    "itemIDList": "",
    "itemSetIDList": "",
    "interest": "",
    "premium": "",
    "itemSetDID": "",
}

url = "https://denver.coloradotaxsale.com/index.cfm?folder=auctionResults&mode=preview"


all_data = []

for data["pageNum"] in range(1, 3):  # <-- increase number of pages here.
    soup = BeautifulSoup(requests.post(url, data=data).content, "html.parser")
    for row in soup.select("#searchResults tr")[2:]:
        tds = [td.text.strip() for td in row.select("td")]
        all_data.append(tds)

columns = [
    "SEQ NUM",
    "Tax Year",
    "Notices",
    "Parcel ID",
    "Face Amount",
    "Winning Bid",
    "Sold To",
]

df = pd.DataFrame(all_data, columns=columns)

# print last 10 items from dataframe:
print(df.tail(10).to_markdown())

Prints:

	SEQ NUM	Tax Year	Parcel ID	Face Amount	Winning Bid	Sold To
96	000094	2020	00031-18-001-000	$905.98	$81.00	00005517
97	000095	2020	00031-18-002-000	$750.13	$75.00	00005517
98	000096	2020	00031-18-003-000	$750.13	$75.00	00005517
99	000097	2020	00031-18-004-000	$750.13	$75.00	00005517
100	000098	2020	00031-18-007-000	$750.13	$76.00	00005517
101	000099	2020	00031-18-008-000	$905.98	$84.00	00005517
102	000100	2020	00031-19-001-000	$1,999.83	$171.00	00005517
103	000101	2020	00031-19-004-000	$1,486.49	$131.00	00005517
104	000102	2020	00031-19-006-000	$1,063.44	$96.00	00005517
105	000103	2020	00031-20-001-000	$1,468.47	$126.00	00005517

CodePudding user response：

Use the information wisely and ensure you have the correct permissions to scrape this site and process the information.

Looks like if you f12 the site and go to the networking -> Payload and switch to page two. Form data shows up with the page number, replicating this form and modifying the page value should allow you to scrape this.

As always there's probably a python package out there which will make this easy