I am trying to learn how to pull data from this url: https://denver.coloradotaxsale.com/index.cfm?folder=auctionResults&mode=preview
However, the problem is that the URL doesn't change when I am trying to switch pages so I am not exactly sure how to enumerate or loop through it. Trying to find a better way since the webpage has 3 thousand datapoints of sales.
Here is my starting code it is very simple but I would appreciate any help that can be given or any hints. I think I might need to change to another package but I am not sure which one maybe beautifulsoup?
import requests
url = "https://denver.coloradotaxsale.com/index.cfm?folder=auctionResults&mode=preview"
html = requests.get(url).content
df_list = pd.read_html(html,header = 1)[0]
df_list = df_list.drop([0,1,2]) #Drop unnecessary rows
CodePudding user response:
To get the data from more pages you can use this example:
import requests
import pandas as pd
from bs4 import BeautifulSoup
data = {
"folder": "auctionResults",
"loginID": "00",
"pageNum": "1",
"orderBy": "AdvNum",
"orderDir": "asc",
"justFirstCertOnGroups": "1",
"doSearch": "true",
"itemIDList": "",
"itemSetIDList": "",
"interest": "",
"premium": "",
"itemSetDID": "",
}
url = "https://denver.coloradotaxsale.com/index.cfm?folder=auctionResults&mode=preview"
all_data = []
for data["pageNum"] in range(1, 3): # <-- increase number of pages here.
soup = BeautifulSoup(requests.post(url, data=data).content, "html.parser")
for row in soup.select("#searchResults tr")[2:]:
tds = [td.text.strip() for td in row.select("td")]
all_data.append(tds)
columns = [
"SEQ NUM",
"Tax Year",
"Notices",
"Parcel ID",
"Face Amount",
"Winning Bid",
"Sold To",
]
df = pd.DataFrame(all_data, columns=columns)
# print last 10 items from dataframe:
print(df.tail(10).to_markdown())
Prints:
SEQ NUM | Tax Year | Notices | Parcel ID | Face Amount | Winning Bid | Sold To | |
---|---|---|---|---|---|---|---|
96 | 000094 | 2020 | 00031-18-001-000 | $905.98 | $81.00 | 00005517 | |
97 | 000095 | 2020 | 00031-18-002-000 | $750.13 | $75.00 | 00005517 | |
98 | 000096 | 2020 | 00031-18-003-000 | $750.13 | $75.00 | 00005517 | |
99 | 000097 | 2020 | 00031-18-004-000 | $750.13 | $75.00 | 00005517 | |
100 | 000098 | 2020 | 00031-18-007-000 | $750.13 | $76.00 | 00005517 | |
101 | 000099 | 2020 | 00031-18-008-000 | $905.98 | $84.00 | 00005517 | |
102 | 000100 | 2020 | 00031-19-001-000 | $1,999.83 | $171.00 | 00005517 | |
103 | 000101 | 2020 | 00031-19-004-000 | $1,486.49 | $131.00 | 00005517 | |
104 | 000102 | 2020 | 00031-19-006-000 | $1,063.44 | $96.00 | 00005517 | |
105 | 000103 | 2020 | 00031-20-001-000 | $1,468.47 | $126.00 | 00005517 |
CodePudding user response:
Use the information wisely and ensure you have the correct permissions to scrape this site and process the information.
Looks like if you f12 the site and go to the networking -> Payload and switch to page two. Form data shows up with the page number, replicating this form and modifying the page value should allow you to scrape this.
As always there's probably a python package out there which will make this easy