Home > Back-end >  BeautifulSoup Returns vs View-Source from Chrome (Zillow)
BeautifulSoup Returns vs View-Source from Chrome (Zillow)

Time:11-27

I have been trying to scrape the code from Zillow but beautifulsoup gives much less code than view-source from chrome. Here is my code:

from bs4 import BeautifulSoup
import requests


from bs4 import BeautifulSoup
import requests
url='https://www.zillow.com/homedetails/49-Mountain-St-Hartford-CT-06106/58139903_zpid/'
html=requests.get(url)
bs = BeautifulSoup(html.text,"html.parser")
bs

Results show that contents in the body are so few. However, if you copy the url and view source code on chrome, you see a lot. Could someone show how to scrape the full contents in the body on Zillow? I saw "Please verify you're a human to continue" in the results, how to handle that?

CodePudding user response:

I think your basic problem is that Zillow will load a lot of additional data after the first page request and use that data to populate the page. Zillow may also do things to discourage web scraping (such as the captcha you're seeing).

How to do this well is a huge topic and not one easily answered in a Stack Overflow question. You can look at this page for a list of resources that may be helpful to you as a scraper - https://github.com/niespodd/browser-fingerprinting

You can also open your network tab in your browser's developer tools (ctrl F11 on Chrome). In the network tab you can see the outgoing requests and the responses. You can find the data you want in the responses and study the requests to find out how to get the data you are looking for.

CodePudding user response:

As for the "verify you are a human", a good captcha today will not have the answer parsable on the client side and eliminate most efforts in modifying the request header. So you may want to try to use selenium browser and web-driver instead of only the requests library, that way you can manually beat the captcha and then let your scrapper do its work.

  • Related