Create dictionaries of varying lengths in loop to append to a list. (json to Pandas DataFrame)-CodePudding

I'm doing anaysis in python. I have a response from a post request in json, and I want to extract several bits of information into a DataFrame to merge with another before doing some analysis.

The json is of the form:

json =     {'type': 'abc',
            'results': [{'users': [{'id': '1',
                                    'score': 0.9},
                                   {'id': '2', 
                                   'score': 0.2}],
                         'num_users': 2,
                         'name': 'john smith'},
                        {'users': [{'id': '3',
                                    'score': 0.7}],
                         'num_users': 1,
                         'name': 'david jones'},
                        {'users': [{'id': '5',
                                    'score': 0.8},
                                   {'id': '6', 
                                    'score': 0.3}],
                         'num_users': 2,
                         'name': 'jane smith'}]}

In the above example I want to extract the name, and their associated the id's and scores:

name	id_1	score_1	id_2	score_2
John Smith	1	0.9	2	0.2
David Jones	3	0.7
Jane Smith	5	0.8	6	0.3

My initial plan when was to loop through and create a list of dictionaries before creating the dataframe, but when I did this I realised that the number of users and therefore scores varied having initially thought it should have 2 ids and scores per name.

scores = []
for i in range(0, len(json):
    scores.append({'name':json[i]['name'],
                   'id_1': json[i]['users'][0]['id'],
                   'score_1': json[i]['users'][0]['score'],
                   'id_2': json[i]['users'][1]['id'],
                   'score_2': json[i]['users'][1]['score']})
pd.DataFrame(scores)

How do I create a dictionary with a varying length and of keys and different key names (_1, _2 etc.) depending upon the length? And will that work when the list of varying length dictionaries is passed to pd.DataFrame?

Thanks.

CodePudding user response：

If data is your dictionary from the question:

df = pd.DataFrame(
    [
        {"name": d["name"]}
        | {
            f"{k}_{i}": v
            for i, d in enumerate(d["users"], 1)
            for k, v in d.items()
        }
        for d in data["results"]
    ]
).fillna("")

print(df.to_markdown())

Prints:

	name	id_1	score_1	id_2	score_2
0	john smith	1	0.9	2	0.2
1	david jones	3	0.7
2	jane smith	5	0.8	6	0.3

Or using ** instead of dict | dict:

df = pd.DataFrame(
    [
        {
            "name": d["name"],
            **{
                f"{k}_{i}": v
                for i, d in enumerate(d["users"], 1)
                for k, v in d.items()
            },
        }
        for d in data["results"]
    ]
).fillna("")

CodePudding user response：

You could use json_normalize and then pivot:

#get the data you need into a dataframe
df = pd.json_normalize(json["results"],record_path="users",meta="name")

#create a counter to pivot
df["counter"] = df.groupby("name").cumcount().add(1)

#pivot to the desired structure
output = df.pivot("name","counter",["id","score"]).sort_values(by="counter",axis=1).rename_axis(None)

#collapse multi-level header to single level
output.columns = output.columns.map("{0[0]}_{0[1]}".format)

>>> output
            id_1 score_1 id_2 score_2
david jones    3     0.7  NaN     NaN
jane smith     5     0.8    6     0.3
john smith     1     0.9    2     0.2

CodePudding user response：

You can quickly use:

>>> pd.json_normalize(json['results'], 'users', 'name')
  id  score         name
0  1    0.9   john smith
1  2    0.2   john smith
2  3    0.7  david jones
3  5    0.8   jane smith
4  6    0.3   jane smith

Then pivot your dataframe:

out = (pd.json_normalize(json['results'], 'users', 'name')
         .assign(colid=lambda x: x.groupby('name')['id'].cumcount().add(1).astype(str))
         .pivot('name', 'colid'))
out.columns = out.columns.to_flat_index().map('_'.join).rename(None)
print(out)

# Output
            id_1 id_2  score_1  score_2
name                                   
david jones    3  NaN      0.7      NaN
jane smith     5    6      0.8      0.3
john smith     1    2      0.9      0.2