I have the following scraping script. I need to loop through many links which differ by T_ID's included in data dictionary. The script is printing the result only for the first T_ID. Any idea how to improve this loop so it prints results for all T_ID's?
import requests
import json
import csv
import sys
from bs4 import BeautifulSoup
data = {'T_ID': [3396750, 3396753, 3396755, 3396757, 3396759]}
base_url = "XXXX"
username = "XXXX"
password = "XXXX"
toget = data
allowed_results = 50
max_results = "maxResults=" str(allowed_results)
tc = "/tcyc?"
result_count = -1
start_index = 0
df = pd.DataFrame(
columns=['id', 'name', 'gId', 'dKey', 'tPlan'])
for eachId in toget['T_ID']:
while result_count != 0:
start_at = "startAt=" str(start_index)
url = url = f'{base_url}{eachId}{tc}&{start_at}&{max_results}'
response = requests.get(url, auth=(username, password))
json_response = json.loads(response.text)
print(json_response)
page_info = json_response["meta"]["pageInfo"]
start_index = page_info["startIndex"] allowed_results
result_count = page_info["resultCount"]
items2 = json_response["data"]
print(items2)
for item in items2:
new_item = {'id': item['id'], **item['fields']}
df = df.append(new_item, ignore_index=True)
print (item["id"])
print (item["project"])
print (item["fields"]["name"])
print (item["fields"]["gId"])
print (item["fields"]["dKey"])
print (item["fields"]["tPlan"])
CodePudding user response:
It doesn't stop, it actually runs all the way through. The issue is the start_index
after it iterates through the first eachId
is no longer 0
. So when it gets to the next id, it's looking at something like:
`'XXXX.com/3396753/tcyc?&startAt=123&maxResults=50'`
And then likely returning a result_count
of 0
, which means the while loop doesn't run. Then it goes to the next id, and the same thing occurs.
Move your initial result_count = -1
and start_index = 0
within the loop, before the while
. As you'd want those to "reset" for each 'T_ID'
:
import pandas as pd
import requests
import json
import csv
import sys
from bs4 import BeautifulSoup
data = {'T_ID': [3396750, 3396753, 3396755, 3396757, 3396759]}
base_url = "XXXX"
username = "XXXX"
password = "XXXX"
toget = data
allowed_results = 50
max_results = "maxResults=" str(allowed_results)
tc = "/tcyc?"
df = pd.DataFrame(
columns=['id', 'name', 'gId', 'dKey', 'tPlan'])
for eachId in toget['T_ID']:
start_index = 0
result_count = -1
while result_count != 0:
start_at = "startAt=" str(start_index)
url = url = f'{base_url}{eachId}{tc}&{start_at}&{max_results}'
response = requests.get(url, auth=(username, password))
json_response = json.loads(response.text)
print(json_response)
page_info = json_response["meta"]["pageInfo"]
start_index = page_info["startIndex"] allowed_results
result_count = page_info["resultCount"]
items2 = json_response["data"]
print(items2)
for item in items2:
new_item = {'id': item['id'], **item['fields']}
df = df.append(new_item, ignore_index=True)
print (item["id"])
print (item["project"])
print (item["fields"]["name"])
print (item["fields"]["gId"])
print (item["fields"]["dKey"])
print (item["fields"]["tPlan"])