The code below extracts data from Zillow Sale.
My 1st question is where people get the headers information.
My 2nd question is how do I know when I needs headers? For some other page like Cars.com, I don't need put headers=headers and I can still get data correctly.
Thank you for your help. HHC
import requests
from bs4 import BeautifulSoup
import re
url ='https://www.zillow.com/baltimore-md-21201/?searchQueryState={"pagination":{},"usersSearchTerm":"21201","mapBounds":{"west":-76.67377295275878,"east":-76.5733510472412,"south":39.26716345016057,"north":39.32309233550334},"regionSelection":[{"regionId":66811,"regionType":7}],"isMapVisible":true,"filterState":{"ah":{"value":true}},"isListVisible":true,"mapZoom":14}'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'referer': 'https://www.zillow.com/new-york-ny/rentals/2_p/?searchQueryState={"pagination'
}
raw_page = requests.get(url, headers=headers)
status = raw_page.status_code
print(status)
# Loading the page content into the beautiful soup
page = raw_page.content
page_soup = BeautifulSoup(page, 'html.parser')
print(page_soup)
CodePudding user response:
You can get headers from going to the site with your browser and using the network tab of the developer tools in there, select a request and you can headers sent in requests.
Some websites don't serve bots, so to make them think you're not a bot you set the user agent header to one a browser uses, some sites may require more headers for you to pass the not a bot test. You can see all the headers being sent in developer tools, you can test different headers until your request succeeds.