How do web scrape more underlying data from a websites map location?-CodePudding

Currently, I have successfully used python to scrape data from a competitor's website to find out store information. The website has a map where you can enter a zip code and it will tell you all the stores in the area of a my current location. The website sends a GET request to pull store data by using this link:

https://www.homedepot.com/StoreSearchServices/v2/storesearch?address=37028&radius=50&pagesize=30

My goal is to scrape all store information not just the imaginary zip code = 12345 & pagesize=30. How should I go about getting all the store information? Would it be better to iterate through a dataset of zip codes to pull all the stores or is there a better way to do this? I've tried expanding past 30 page size but it looks like that is the limit on the request.

CodePudding user response：

This url gives JSON with "currentPage":1 which can means it can use some kind of pagination.

I added &page=2 and it seems it works

Page 1:

https://www.homedepot.com/StoreSearchServices/v2/storesearch?address=37028&radius=250&pagesize=40&page=1

Page 2:

https://www.homedepot.com/StoreSearchServices/v2/storesearch?address=37028&radius=250&pagesize=40&page=2

Page 3:

https://www.homedepot.com/StoreSearchServices/v2/storesearch?address=37028&radius=250&pagesize=40&page=3

For test I use bigger range=250 to get JSON with "recordCount":123

I found that it works also with pagesize=40.
For bigger value it sends JSON with error message.

EDIT:

Minimal working code:

Page blocks request without User-Agent

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0',
}

url = 'https://www.homedepot.com/StoreSearchServices/v2/storesearch'

payload = {
    'address': 37028,
    'radius': 250,
    'pagesize': 40,
    'page': 1,
}

page = 0

while True:

    page  = 1
    print('--- page:', page, '---')
    
    payload['page'] = page
    response = requests.get(url, params=payload, headers=headers)
    
    data = response.json()

    print(data['searchReport'])
                        
    if "stores" not in data:
        break
    
    for number, item in enumerate(data['stores'], 1):
        print(f'{number:2} | phone: {item["phone"]} | zip: {item["address"]["postalCode"]}')

Result:

--- page: 1 ---
{'recordCount': 123, 'currentPage': 1, 'storesPerPage': 40}
 1 | phone: (931)906-2655 | zip: 37040
 2 | phone: (270)442-0817 | zip: 42001
 3 | phone: (615)662-7600 | zip: 37221
 4 | phone: (615)865-9600 | zip: 37115
 5 | phone: (615)228-3317 | zip: 37216
 6 | phone: (615)269-7800 | zip: 37204
 7 | phone: (615)824-2391 | zip: 37075
 8 | phone: (615)370-0730 | zip: 37027
 9 | phone: (615)889-7211 | zip: 37076
10 | phone: (615)599-4578 | zip: 37064

etc. 

--- page: 2 ---
{'recordCount': 123, 'currentPage': 2, 'storesPerPage': 40}
 1 | phone: (662)890-9470 | zip: 38654
 2 | phone: (502)964-1845 | zip: 40219
 3 | phone: (812)941-9641 | zip: 47150
 4 | phone: (812)282-0470 | zip: 47129
 5 | phone: (662)349-6080 | zip: 38637
 6 | phone: (502)899-3706 | zip: 40207
 7 | phone: (662)840-8390 | zip: 38866
 8 | phone: (502)491-3682 | zip: 40220
 9 | phone: (870)268-0619 | zip: 72404
10 | phone: (256)575-2100 | zip: 35768

etc.

If you want to keep as DataFrame then maybe first put all items on list and later convert this list to DataFrame

# --- before loop ----

all_items = []

page = 0

# --- loop ----

while True:

    # ... code ...
    
    for number, item in enumerate(data['stores'], 1):
        print(f'{number:2} | phone: {item["phone"]} | zip: {item["address"]["postalCode"]}')
        all_items.append(item)

# --- after loop ----

import pandas as pd

df = pd.DataFrame(all_items)

print(df)

Because JSON keep address as directory {'postCode': ... , ...} so some columns may have it as directory

print(df.iloc[0])

storeId                                                             0726
name                                                     Clarksville, TN
phone                                                      (931)906-2655
address                {'postalCode': '37040', 'county': 'Montgomery'...
coordinates                        {'lat': 36.581677, 'lng': -87.300826}
services               {'loadNGo': True, 'propane': True, 'toolRental...
storeContacts                 [{'name': 'Brenda G.', 'role': 'Manager'}]
storeHours             {'monday': {'open': '6:00', 'close': '21:00'},...
url                           /l/Clarksville-TN/TN/Clarksville/37040/726
distance                                                       32.530296
proDeskPhone                                               (931)920-9400
flags                  {'bopisFlag': True, 'assemblyFlag': True, 'bos...
marketNbr                                                           0019
axGeoCode                                                             00
storeTimeZone                                                    CST6CDT
curbsidePickupHours    {'monday': {'open': '09:00', 'close': '18:00'}...
storeOpenDt                                                   1998-08-13
storeType                                                         retail
toolRentalPhone                                                      NaN

See: { } in address, services, storeHours,etc

It may need also to convert it to separated rows.

df['address'].apply(pd.Series)

and concat it with original df

df2 = pd.concat( [df, df['address'].apply(pd.Series)], axis=1 )

The same way you may do with other columns.

CodePudding user response：

I had the same issue before and you stated one of the solutions,

I recommend searching the domain/sitemap.xml and domain/robots.txt for the available stores.

also sometimes the data is stored in the .js requests so open the network tab and search for one of the stores' id.