Home > OS >  How do I web scrape this link and iterate through the page numbers?
How do I web scrape this link and iterate through the page numbers?

Time:06-10

My goal is to web scrape this url link and iterate through the pages. I keep getting a strange error. My code and error follows:

import requests
import json
import pandas as pd

url = 'https://www.acehardware.com/api/commerce/storefront/locationUsageTypes/SP/locations?page='

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0',
}


#create a url list to scrape data from all pages
url_list = []

for i in range(0, 4375):
  url_list.append(url   str(i))

response = requests.get(url, headers=headers)
data = response.json()

d = json.dumps(data)
df = pd.json_normalize(d)

Error:

{'items': [{'applicationName': 'ReverseProxy', 'errorCode': 'UNAUTHORIZED', 'message': 'You are Unauthorized to perform the attempted operation. Application access token required', 'additionalErrorData': [{'name': 'OperationName', 'value': 'http://www.acehardware.com/api/commerce/storefront/locationUsageTypes/SP/locations?page=0&page=1'}]}], 'exceptionDetail': {'type': 'Mozu.Core.Exceptions.VaeUnAuthorizedException'}

This is strange to me because I should be able to access each page on this url

Specifically, since I can follow the link and copy and paste the json data. Is there a way to scrape this site without an api key?

CodePudding user response:

It works in your browser because you have the cookie token saved in you local storage.
Once you delete all cookies, it does not work when you try to navigate to API link directly.
The token cookie is sb-sf-at-prod-s. Add this cookie to your headers and it will work.
I do not know if the value of this cookie is linked to my ip address. But if it is and it does not work for you. Just change the value of this cookie to one from your browser.
This cookies maybe is valid only for some request or for some time.
I recommend you to put some sleep between each request.
This website has security antibot Akamai.

import requests
import json

url = 'https://www.acehardware.com/api/commerce/storefront/locationUsageTypes/SP/locations?page='
headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0',
    'cookie': 'sb-sf-at-prod=at=/VzynTSsuVJGJMAd8+jAO67EUtyn1fIEaqKmCi923rynHnztv6rQZH/5LMa7pmMBRiW00x2L+/LfmJhJKLpNMoK9OFJi069WHbzphl+ZFM/pBV+dqmhCL/tylU11GQYQ8y7qavW4MWS4xJzWdmKV/01iJ0RkwynJLgcXmCzcde2oqgxa/AYWa0hN0xuYBMFlCoHJab1z3CU/01FJlsBDzXmJwb63zAJGVj4PIH5LvlcbnbOhbouQBKxCrMyrmpvxDf70U3nTl9qxF9qgOyTBZnvMBk1juoK8wL1K3rYp51nBC0O+thd94wzQ9Vkolk+4y8qapFaaxRtfZiBqhAAtMg=='
}
#create a url list to scrape data from all pages
url_list = []
for i in range(0, 4375):
    url_list.append(url   str(i))
response = requests.get(url, headers=headers)
data = response.json()
d = json.dumps(data)
print(d)

I hope i was able to help you.

  • Related