I'm doing anaysis in python. I have a response from a post request in json, and I want to extract several bits of information into a DataFrame to merge with another before doing some analysis.
The json is of the form:
json = {'type': 'abc',
'results': [{'users': [{'id': '1',
'score': 0.9},
{'id': '2',
'score': 0.2}],
'num_users': 2,
'name': 'john smith'},
{'users': [{'id': '3',
'score': 0.7}],
'num_users': 1,
'name': 'david jones'},
{'users': [{'id': '5',
'score': 0.8},
{'id': '6',
'score': 0.3}],
'num_users': 2,
'name': 'jane smith'}]}
In the above example I want to extract the name, and their associated the id's and scores:
name | id_1 | score_1 | id_2 | score_2 |
---|---|---|---|---|
John Smith | 1 | 0.9 | 2 | 0.2 |
David Jones | 3 | 0.7 | ||
Jane Smith | 5 | 0.8 | 6 | 0.3 |
My initial plan when was to loop through and create a list of dictionaries before creating the dataframe, but when I did this I realised that the number of users and therefore scores varied having initially thought it should have 2 ids and scores per name.
scores = []
for i in range(0, len(json):
scores.append({'name':json[i]['name'],
'id_1': json[i]['users'][0]['id'],
'score_1': json[i]['users'][0]['score'],
'id_2': json[i]['users'][1]['id'],
'score_2': json[i]['users'][1]['score']})
pd.DataFrame(scores)
How do I create a dictionary with a varying length and of keys and different key names (_1, _2 etc.) depending upon the length? And will that work when the list of varying length dictionaries is passed to pd.DataFrame?
Thanks.
CodePudding user response:
If data
is your dictionary from the question:
df = pd.DataFrame(
[
{"name": d["name"]}
| {
f"{k}_{i}": v
for i, d in enumerate(d["users"], 1)
for k, v in d.items()
}
for d in data["results"]
]
).fillna("")
print(df.to_markdown())
Prints:
name | id_1 | score_1 | id_2 | score_2 | |
---|---|---|---|---|---|
0 | john smith | 1 | 0.9 | 2 | 0.2 |
1 | david jones | 3 | 0.7 | ||
2 | jane smith | 5 | 0.8 | 6 | 0.3 |
Or using **
instead of dict | dict
:
df = pd.DataFrame(
[
{
"name": d["name"],
**{
f"{k}_{i}": v
for i, d in enumerate(d["users"], 1)
for k, v in d.items()
},
}
for d in data["results"]
]
).fillna("")
CodePudding user response:
You could use json_normalize
and then pivot
:
#get the data you need into a dataframe
df = pd.json_normalize(json["results"],record_path="users",meta="name")
#create a counter to pivot
df["counter"] = df.groupby("name").cumcount().add(1)
#pivot to the desired structure
output = df.pivot("name","counter",["id","score"]).sort_values(by="counter",axis=1).rename_axis(None)
#collapse multi-level header to single level
output.columns = output.columns.map("{0[0]}_{0[1]}".format)
>>> output
id_1 score_1 id_2 score_2
david jones 3 0.7 NaN NaN
jane smith 5 0.8 6 0.3
john smith 1 0.9 2 0.2
CodePudding user response:
You can quickly use:
>>> pd.json_normalize(json['results'], 'users', 'name')
id score name
0 1 0.9 john smith
1 2 0.2 john smith
2 3 0.7 david jones
3 5 0.8 jane smith
4 6 0.3 jane smith
Then pivot your dataframe:
out = (pd.json_normalize(json['results'], 'users', 'name')
.assign(colid=lambda x: x.groupby('name')['id'].cumcount().add(1).astype(str))
.pivot('name', 'colid'))
out.columns = out.columns.to_flat_index().map('_'.join).rename(None)
print(out)
# Output
id_1 id_2 score_1 score_2
name
david jones 3 NaN 0.7 NaN
jane smith 5 6 0.8 0.3
john smith 1 2 0.9 0.2