Home > OS >  Scraping a webpage with Python but unsure how to deal with a static(?) URL
Scraping a webpage with Python but unsure how to deal with a static(?) URL

Time:09-01

I am trying to learn how to pull data from this url: https://denver.coloradotaxsale.com/index.cfm?folder=auctionResults&mode=preview

However, the problem is that the URL doesn't change when I am trying to switch pages so I am not exactly sure how to enumerate or loop through it. Trying to find a better way since the webpage has 3 thousand datapoints of sales.

Here is my starting code it is very simple but I would appreciate any help that can be given or any hints. I think I might need to change to another package but I am not sure which one maybe beautifulsoup?

import requests 
url = "https://denver.coloradotaxsale.com/index.cfm?folder=auctionResults&mode=preview"

html = requests.get(url).content
df_list = pd.read_html(html,header = 1)[0]
df_list = df_list.drop([0,1,2]) #Drop unnecessary rows 

CodePudding user response:

To get the data from more pages you can use this example:

import requests
import pandas as pd
from bs4 import BeautifulSoup


data = {
    "folder": "auctionResults",
    "loginID": "00",
    "pageNum": "1",
    "orderBy": "AdvNum",
    "orderDir": "asc",
    "justFirstCertOnGroups": "1",
    "doSearch": "true",
    "itemIDList": "",
    "itemSetIDList": "",
    "interest": "",
    "premium": "",
    "itemSetDID": "",
}

url = "https://denver.coloradotaxsale.com/index.cfm?folder=auctionResults&mode=preview"


all_data = []

for data["pageNum"] in range(1, 3):  # <-- increase number of pages here.
    soup = BeautifulSoup(requests.post(url, data=data).content, "html.parser")
    for row in soup.select("#searchResults tr")[2:]:
        tds = [td.text.strip() for td in row.select("td")]
        all_data.append(tds)

columns = [
    "SEQ NUM",
    "Tax Year",
    "Notices",
    "Parcel ID",
    "Face Amount",
    "Winning Bid",
    "Sold To",
]

df = pd.DataFrame(all_data, columns=columns)

# print last 10 items from dataframe:
print(df.tail(10).to_markdown())

Prints:

SEQ NUM Tax Year Notices Parcel ID Face Amount Winning Bid Sold To
96 000094 2020 00031-18-001-000 $905.98 $81.00 00005517
97 000095 2020 00031-18-002-000 $750.13 $75.00 00005517
98 000096 2020 00031-18-003-000 $750.13 $75.00 00005517
99 000097 2020 00031-18-004-000 $750.13 $75.00 00005517
100 000098 2020 00031-18-007-000 $750.13 $76.00 00005517
101 000099 2020 00031-18-008-000 $905.98 $84.00 00005517
102 000100 2020 00031-19-001-000 $1,999.83 $171.00 00005517
103 000101 2020 00031-19-004-000 $1,486.49 $131.00 00005517
104 000102 2020 00031-19-006-000 $1,063.44 $96.00 00005517
105 000103 2020 00031-20-001-000 $1,468.47 $126.00 00005517

CodePudding user response:

Use the information wisely and ensure you have the correct permissions to scrape this site and process the information.

Looks like if you f12 the site and go to the networking -> Payload and switch to page two. Form data shows up with the page number, replicating this form and modifying the page value should allow you to scrape this.

As always there's probably a python package out there which will make this easy

  • Related