Home > Software engineering >  What Am I doing wrong in scraping. Returns no value for my code
What Am I doing wrong in scraping. Returns no value for my code

Time:11-16

My code works for one site and not another site. Can some one help me out.

 import requests
 from bs4 import BeautifulSoup
 URL = "https://www.homedepot.com/s/311256393"
 page = requests.get(URL)
 soup = BeautifulSoup(page.content, "html.parser")
 results = soup.find(id="root")
 print(results.prettify())

Where as below code shows output, is the any difference on website?

 import requests
 from bs4 import BeautifulSoup
 URL = "https://realpython.github.io/fake-jobs/"
 page = requests.get(URL)
 soup = BeautifulSoup(page.content, "html.parser")
 results = soup.find(id="ResultsContainer")
 print(results.prettify())

CodePudding user response:

When parsing The Home Depot you need to use proxies (if your IP is outside the US, otherwise it will throw an Access denied error) and parse the data from their GraphQL API (Dev Tools -> Network -> Fetch\XHR -> find appropriate name -> Headers (opened tab on the right after clicking on the name) -> URL) and make a request to appropriate URL address.

Then use JSON Response Content via requests library: requests.get("URL").json() which will decode JSON string to a Python dictionary.


Alternatively, if you don't want to deal with bypassing blocks, you can get the desired output by using The Home Depot Search Engine Results API from SerpApi. It's a paid API with a free plan.

The difference is that you don't have to deal with blocks as mentioned above, figure out how to scale the number of requests (if needed), and there's no need to maintain it over time (if something in the HTML will be changed). Check out the playground with a product you were looking for (requires login).

Example code to integrate and example in the online IDE:

from serpapi import GoogleSearch
import os

params = {
  "api_key": os.getenv("API_KEY"), 
  "engine": "home_depot_product",  #                                ↓↓↓
  "product_id": "311256393"        # https://www.homedepot.com/s/311256393 ←
                                   #                                ↑↑↑
}                          

search = GoogleSearch(params)
results = search.get_dict()

title = results["product_results"]["title"]
link = results["product_results"]["link"]
price = results["product_results"]["price"]
rating = results["product_results"]["rating"]

print(title, link, price, rating, sep="\n")


# actual JSON response is much bigger
'''
20 in. x 20 in. Palace Tile Outdoor Throw Pillow with Fringe
https://www.homedepot.com/p/Hampton-Bay-20-in-x-20-in-Palace-Tile-Outdoor-Throw-Pillow-with-Fringe-7747-04413111/311256393
19.98
5.0
'''

A quick glance at available product_results:

for key in results["product_results"]:
    print(key, sep="\n")


'''
product_id
title
description
link
upc
model_number
favorite
rating
reviews
price
highlights
brand
images
bullets
specifications
fulfillment
'''

Disclaimer, I work for SerpApi.


P.S. I have a dedicated web scraping blog.

  • Related