I have the following scraping script, I need to get elements inside a "items2" foor loop. The script is printing all elements, but later on dataframe returns "name" and "tPlan" as NaN. Any idea why?
import requests
import json
import csv
import sys
from bs4 import BeautifulSoup
base_url = "xxxx"
username = "xxxx"
password = "xxxx"
toget = data
allowed_results = 50
max_results = "maxResults=" str(allowed_results)
tc = "/testcycles?"
result_count = -1
start_index = 0
df = pd.DataFrame(
columns=['id', 'name', 'gId', 'dKey', 'tPlan'])
for eachId in toget['TPlan_ID']:
while result_count != 0:
start_at = "startAt=" str(start_index)
url = url = f'{base_url}{eachId}{tc}&{start_at}&{max_results}'
response = requests.get(url, auth=(username, password))
json_response = json.loads(response.text)
print(json_response)
page_info = json_response["meta"]["pageInfo"]
start_index = page_info["startIndex"] allowed_results
result_count = page_info["resultCount"]
items2 = json_response["data"]
print(items2)
for item in items2:
print (item["id"])
print (item["fields"]["name"])
print (item["fields"]["gId"])
print (item["fields"]["dKey"])
print (item["fields"]["tPlan"])
temporary_df = pd.DataFrame([item], columns=['id', 'name', 'gId', 'dKey', 'tPlan'])
df = df.append(temporary_df, ignore_index=True)
CodePudding user response:
TLDR
Use this for loop.
for item in items2:
df = df.append({'id': item['id'], **item['fields']}, ignore_index=True)
Explanation
I am making this assumption that the items2
would look something like this.
items2 = [
{ 'id': 0, 'fields': {'name': 'prop1', 'gId': 100, 'dKey': 'key1', 'tPlan': 'plan1'}},
{ 'id': 1, 'fields': {'name': 'prop2', 'gId': 200, 'dKey': 'key2', 'tPlan': 'plan2'}},
{ 'id': 2, 'fields': {'name': 'prop3', 'gId': 300, 'dKey': 'key3', 'tPlan': 'plan3'}},
]
You can't create your intended data frame since the structure of item
is like this.
{'id': 2, 'fields': {'name': 'prop3', 'gId': 300, 'dKey': 'key3', 'tPlan': 'plan3'}}
which results in temporary_df
filled with NaN.
id name gId dKey tPlan fields
0 0 NaN NaN NaN NaN key1
1 0 NaN NaN NaN NaN 100
2 0 NaN NaN NaN NaN prop1
3 0 NaN NaN NaN NaN plan1
4 1 NaN NaN NaN NaN key2
5 1 NaN NaN NaN NaN 200
6 1 NaN NaN NaN NaN prop2
7 1 NaN NaN NaN NaN plan2
8 2 NaN NaN NaN NaN key3
9 2 NaN NaN NaN NaN 300
10 2 NaN NaN NaN NaN prop3
11 2 NaN NaN NaN NaN plan3
What you would need to pass as argument to pd.DataFrame is a dict structure like
{'id': 2, 'name': 'prop3', 'gId': 300, 'dKey': 'key3', 'tPlan': 'plan3'}
Notic the missing fields
dict here, all the key value pair from fields
are added to item
. Using this altered dict would result in temporary_df
like
id name gId dKey tPlan
0 0 prop1 100 key1 plan1
1 1 prop2 200 key2 plan2
2 2 prop3 300 key3 plan3
To make this change in item structure you should do this
new_item = {'id': item['id']}
for key, value in item['fields'].items():
new_item[key] = value
But you can write this concisely by using the unpacking operator **
new_item = {'id': item['id'], **item['fields']}
Now we can use pass new_item
as argument to pd.DataFrame
.
temp_df = pd.DataFrame({ 'id': item['id'], **item['fields']}, index=(i,)) # i here is the row index of the DataFrame
After making these changes your for loop should look something like this
for i, item in enumerate(items2):
new_item = {'id': item['id'], **item['fields']}
temp_df = pd.DataFrame(new_item, index=(i,))
df = df.append(temp_df, ignore_index=True)
We can make this a bit more concise by directly passing the new_item
to pd.DataFrame.append
Thus in the end this code should work.
for item in items2:
new_item = {'id': item['id'], **item['fields']}
df = df.append(new_item, ignore_index=True)