I have a set of brand numbers for a webpage url. I convert the webpage url into an f-string, and apply the brand number where it's supposed to. Each page has a unique ID to load the next page. I'm trying to extract this next page whilst matching the brand number the Id belongs to.
Here's some sample code:
import requests
import pandas as pd
from bs4 import BeautifulSoup
brands = [989,1344,474,1237,886,1,328,2188]
testid = {}
for b in brands:
url = f'https://webapi.depop.com/api/v2/search/products/?brands={b}&itemsPerPage=24&country=gb¤cy=GBP&sort=relevance'
payload={}
headers = {}
response = requests.request("GET", url, headers=headers, data=payload)
test= pd.read_json(StringIO(response.text), lines=True)
for m in test['meta'].items():
if m[1]['hasMore'] == True:
testid[str(b)]= [m[1]['cursor']]
else:
continue
for br in testid.keys():
while True:
html = f'https://webapi.depop.com/api/v2/search/products/?brands={br}&cursor={testid[str(br)][-1]}&itemsPerPage=24&country=gb¤cy=GBP&sort=relevance'
r = requests.request("GET",html, headers=headers, data=payload)
read_id = pd.read_json(StringIO(r.text), lines=True)
for m in read_id['meta'].items():
try:
testid[str(br)].append(m[1]['cursor'])
except:
continue
Here's the output it produces:
{'989': ['MnwyNHwxNjQwMDMwODcw']}
However, it replaces the values originally in the brand number and only leaves the last one collected. It should leave a list and produce something like this:
{'989': ['MnwyNHwxNjQwMDI4Mzk1', ...],
'1344': ['MnwyNHwxNjQwMDI4Mzk2', ...],
'474': ['MnwyNHwxNjQwMDI4Mzk3', ...],
'1237': ['MnwyNHwxNjQwMDI4Mzk3', ...],
'886': ['MnwyNHwxNjQwMDI4Mzk4', ...],
'1': ['MnwyNHwxNjQwMDI4Mzk4', ...],
'328': ['MnwyNHwxNjQwMDI4Mzk5', ...],
Where the triple dots ...
denotes the additional ID values collected from the page with that brand number. How can I get an output like this?
CodePudding user response:
After setting the testid
list to be a collections.defaultdict(list)
the rest falls out in a rather straightforward manner..
Note: I'm only going to fetch the first 3 cursors of any product but you can do them all as you like.
import collections
import requests
brands = [989,1344,474,1237,886,1,328,2188]
testid = collections.defaultdict(list)
for b in brands:
headers = {}
payload={}
url = f"https://webapi.depop.com/api/v2/search/products/?brands={b}&itemsPerPage=24&country=gb¤cy=GBP&sort=relevance"
response = requests.request("GET", url, headers=headers, data=payload)
data = response.json()
i = 0 # short circuit
while data.get("meta", {}).get("hasMore") and i < 3:
cursor = data.get("meta", {}).get("cursor")
testid[str(b)].append(cursor)
response = requests.request("GET", f"{url}&cursor={cursor}", headers=headers, data=payload)
data = response.json()
i = 1
for key, value in testid.items():
print(key, value)
This gives us:
989 ['MnwyNHwxNjQwMDMzMjM0']
1344 ['MnwyNHwxNjQwMDMzMjM1', 'M3w0OHwxNjQwMDMzMjM1', 'NHw3MnwxNjQwMDMzMjM1']
474 ['MnwyNHwxNjQwMDMzMjM3', 'M3w0OHwxNjQwMDMzMjM3', 'NHw3MnwxNjQwMDMzMjM3']
1237 ['MnwyNHwxNjQwMDMzMjM5', 'M3w0OHwxNjQwMDMzMjM5', 'NHw3MnwxNjQwMDMzMjM5']
886 ['MnwyNHwxNjQwMDMzMjQz', 'M3w0OHwxNjQwMDMzMjQz', 'NHw3MnwxNjQwMDMzMjQz']
1 ['MnwyNHwxNjQwMDMzMjQ4', 'M3w0OHwxNjQwMDMzMjQ4', 'NHw3MnwxNjQwMDMzMjQ4']
328 ['MnwyNHwxNjQwMDMzMjUz', 'M3w0OHwxNjQwMDMzMjUz', 'NHw3MnwxNjQwMDMzMjUz']
Wait a sec.... What is going on with:
data.get("meta", {}).get("hasMore")
Great question and I should have explained it before.
So, there is a chance that data.meta
is not defined and if that was true, the following would fail;
data["meta"].get("hasMore")
as would
data.get("meta").get("hasMore")
So what we did:
data.get("meta", {}).get("hasMore")
was use the second parameter of get()
to provide a default value. In this case it is just an empty dict
but that is enough for us to safely chain the followup .get("hasMore")
onto.