Currently, I have successfully used python to scrape data from a competitor's website to find out store information. The website has a map where you can enter a zip code and it will tell you all the stores in the area of a my current location. The website sends a GET request to pull store data by using this link:
https://www.homedepot.com/StoreSearchServices/v2/storesearch?address=37028&radius=50&pagesize=30
My goal is to scrape all store information not just the imaginary zip code = 12345 & pagesize=30. How should I go about getting all the store information? Would it be better to iterate through a dataset of zip codes to pull all the stores or is there a better way to do this? I've tried expanding past 30 page size but it looks like that is the limit on the request.
CodePudding user response:
This url gives JSON with "currentPage":1
which can means it can use some kind of pagination.
I added &page=2
and it seems it works
Page 1:
Page 2:
Page 3:
For test I use bigger range=250
to get JSON with "recordCount":123
I found that it works also with pagesize=40
.
For bigger value it sends JSON with error message.
EDIT:
Minimal working code:
Page blocks request without User-Agent
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0',
}
url = 'https://www.homedepot.com/StoreSearchServices/v2/storesearch'
payload = {
'address': 37028,
'radius': 250,
'pagesize': 40,
'page': 1,
}
page = 0
while True:
page = 1
print('--- page:', page, '---')
payload['page'] = page
response = requests.get(url, params=payload, headers=headers)
data = response.json()
print(data['searchReport'])
if "stores" not in data:
break
for number, item in enumerate(data['stores'], 1):
print(f'{number:2} | phone: {item["phone"]} | zip: {item["address"]["postalCode"]}')
Result:
--- page: 1 ---
{'recordCount': 123, 'currentPage': 1, 'storesPerPage': 40}
1 | phone: (931)906-2655 | zip: 37040
2 | phone: (270)442-0817 | zip: 42001
3 | phone: (615)662-7600 | zip: 37221
4 | phone: (615)865-9600 | zip: 37115
5 | phone: (615)228-3317 | zip: 37216
6 | phone: (615)269-7800 | zip: 37204
7 | phone: (615)824-2391 | zip: 37075
8 | phone: (615)370-0730 | zip: 37027
9 | phone: (615)889-7211 | zip: 37076
10 | phone: (615)599-4578 | zip: 37064
etc.
--- page: 2 ---
{'recordCount': 123, 'currentPage': 2, 'storesPerPage': 40}
1 | phone: (662)890-9470 | zip: 38654
2 | phone: (502)964-1845 | zip: 40219
3 | phone: (812)941-9641 | zip: 47150
4 | phone: (812)282-0470 | zip: 47129
5 | phone: (662)349-6080 | zip: 38637
6 | phone: (502)899-3706 | zip: 40207
7 | phone: (662)840-8390 | zip: 38866
8 | phone: (502)491-3682 | zip: 40220
9 | phone: (870)268-0619 | zip: 72404
10 | phone: (256)575-2100 | zip: 35768
etc.
If you want to keep as DataFrame
then maybe first put all items on list and later convert this list to DataFrame
# --- before loop ----
all_items = []
page = 0
# --- loop ----
while True:
# ... code ...
for number, item in enumerate(data['stores'], 1):
print(f'{number:2} | phone: {item["phone"]} | zip: {item["address"]["postalCode"]}')
all_items.append(item)
# --- after loop ----
import pandas as pd
df = pd.DataFrame(all_items)
print(df)
Because JSON keep address
as directory {'postCode': ... , ...}
so some columns may have it as directory
print(df.iloc[0])
storeId 0726
name Clarksville, TN
phone (931)906-2655
address {'postalCode': '37040', 'county': 'Montgomery'...
coordinates {'lat': 36.581677, 'lng': -87.300826}
services {'loadNGo': True, 'propane': True, 'toolRental...
storeContacts [{'name': 'Brenda G.', 'role': 'Manager'}]
storeHours {'monday': {'open': '6:00', 'close': '21:00'},...
url /l/Clarksville-TN/TN/Clarksville/37040/726
distance 32.530296
proDeskPhone (931)920-9400
flags {'bopisFlag': True, 'assemblyFlag': True, 'bos...
marketNbr 0019
axGeoCode 00
storeTimeZone CST6CDT
curbsidePickupHours {'monday': {'open': '09:00', 'close': '18:00'}...
storeOpenDt 1998-08-13
storeType retail
toolRentalPhone NaN
See: { }
in address
, services
, storeHours
,etc
It may need also to convert it to separated rows.
df['address'].apply(pd.Series)
and concat it with original df
df2 = pd.concat( [df, df['address'].apply(pd.Series)], axis=1 )
The same way you may do with other columns.
CodePudding user response:
I had the same issue before and you stated one of the solutions,
I recommend searching the domain/sitemap.xml and domain/robots.txt for the available stores.
also sometimes the data is stored in the .js requests so open the network tab and search for one of the stores' id.