Scraping Real Estate Website using Python-CodePudding

I am trying to scrape the MLS Number, Price, and Address of real estate listings from a website using BeautifulSoup.

import requests
from bs4 import BeautifulSoup

# string url
str_url = 'https://www.utahrealestate.com/search/map.search'

# get response
response = requests.get(str_url)

# get html
soup = BeautifulSoup(response.text, 'html.parser')

# get the number of listings and assign it to int_n_pages (I cant get this to work; it returns NoneType)
int_n_pages = soup.find('li', {'class': 'view-results'})

# split and get n pages (this does not work because the previous line does not work)
int_n_pages = int(int_n_pages.split(' ')[2])

Next, my plan is to iterate through all pages and extract the information from each listing.

Something like...

# empty list
list_dict_cards = []

# iterate through pages
for int_page in range(1, int_n_pages 1):

    # get url
    str_url = f'https://www.utahrealestate.com/search/map.search/page/{int_page}/vtype/map'

    # get response
    response = requests.get(str_url)

    # get html
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # get property cards
    property_cards = soup.find_all(class_='property___card')

    # iterate through property cards
    for card in property_cards:

        # empty dict
        dict_card = {}

        # get mls number
        int_mls = card.find(class_='mls___number').text.split(' ')[1]

        # put into dict_card
        dict_card['mls'] = int_mls

        # I would get other info here as well and put into dict_card

        # append dict_card to list_cards
        list_dict_cards.append(dict.card)

# make df
df_cards = pd.DataFrame(list_dict_cards)

# save
df_cards.to_csv('./output/df_dict_cards.csv', index=False)

I am pretty sure the site is attempting to prevent programmatically accessing much of the info it displays.

How/is there away around this?

CodePudding user response：

There is an endpoint that looks like it can be scraped effectively if you make a POST request to it with the right headers after you've visited the home page (probably to have the right cookies in your session. The below example seems to do the trick. This site is very slow, not the script.

import requests

s = requests.Session()

headers = {
    'Accept':'application/json, text/javascript, */*; q=0.01',
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36',
    }
home = 'https://www.utahrealestate.com/search/map.search'
step = s.get(home,headers=headers)

headers =   {
    'Accept':'application/json, text/javascript, */*; q=0.01',
    'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',
    'Host':'www.utahrealestate.com',
    'Origin':'https://www.utahrealestate.com',
    'Referer':'https://www.utahrealestate.com/search/map.search',
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36',
    'X-Requested-With':'XMLHttpRequest'
    }

for page in range(1,5):
    url = f'https://www.utahrealestate.com/search/map.inline.results/pg/{page}/sort/entry_date_desc/paging/0/dh/862'
    data = s.post(url,headers=headers).json()
    results = len(data['listing_data'])

    print(f'Scraped {results} results from page {page}')