Home > Software engineering >  Create dictionaries of varying lengths in loop to append to a list. (json to Pandas DataFrame)
Create dictionaries of varying lengths in loop to append to a list. (json to Pandas DataFrame)

Time:04-29

I'm doing anaysis in python. I have a response from a post request in json, and I want to extract several bits of information into a DataFrame to merge with another before doing some analysis.

The json is of the form:

json =     {'type': 'abc',
            'results': [{'users': [{'id': '1',
                                    'score': 0.9},
                                   {'id': '2', 
                                   'score': 0.2}],
                         'num_users': 2,
                         'name': 'john smith'},
                        {'users': [{'id': '3',
                                    'score': 0.7}],
                         'num_users': 1,
                         'name': 'david jones'},
                        {'users': [{'id': '5',
                                    'score': 0.8},
                                   {'id': '6', 
                                    'score': 0.3}],
                         'num_users': 2,
                         'name': 'jane smith'}]}

In the above example I want to extract the name, and their associated the id's and scores:

name id_1 score_1 id_2 score_2
John Smith 1 0.9 2 0.2
David Jones 3 0.7
Jane Smith 5 0.8 6 0.3

My initial plan when was to loop through and create a list of dictionaries before creating the dataframe, but when I did this I realised that the number of users and therefore scores varied having initially thought it should have 2 ids and scores per name.

scores = []
for i in range(0, len(json):
    scores.append({'name':json[i]['name'],
                   'id_1': json[i]['users'][0]['id'],
                   'score_1': json[i]['users'][0]['score'],
                   'id_2': json[i]['users'][1]['id'],
                   'score_2': json[i]['users'][1]['score']})
pd.DataFrame(scores)

How do I create a dictionary with a varying length and of keys and different key names (_1, _2 etc.) depending upon the length? And will that work when the list of varying length dictionaries is passed to pd.DataFrame?

Thanks.

CodePudding user response:

If data is your dictionary from the question:

df = pd.DataFrame(
    [
        {"name": d["name"]}
        | {
            f"{k}_{i}": v
            for i, d in enumerate(d["users"], 1)
            for k, v in d.items()
        }
        for d in data["results"]
    ]
).fillna("")

print(df.to_markdown())

Prints:

name id_1 score_1 id_2 score_2
0 john smith 1 0.9 2 0.2
1 david jones 3 0.7
2 jane smith 5 0.8 6 0.3

Or using ** instead of dict | dict:

df = pd.DataFrame(
    [
        {
            "name": d["name"],
            **{
                f"{k}_{i}": v
                for i, d in enumerate(d["users"], 1)
                for k, v in d.items()
            },
        }
        for d in data["results"]
    ]
).fillna("")

CodePudding user response:

You could use json_normalize and then pivot:

#get the data you need into a dataframe
df = pd.json_normalize(json["results"],record_path="users",meta="name")

#create a counter to pivot
df["counter"] = df.groupby("name").cumcount().add(1)

#pivot to the desired structure
output = df.pivot("name","counter",["id","score"]).sort_values(by="counter",axis=1).rename_axis(None)

#collapse multi-level header to single level
output.columns = output.columns.map("{0[0]}_{0[1]}".format)

>>> output
            id_1 score_1 id_2 score_2
david jones    3     0.7  NaN     NaN
jane smith     5     0.8    6     0.3
john smith     1     0.9    2     0.2

CodePudding user response:

You can quickly use:

>>> pd.json_normalize(json['results'], 'users', 'name')
  id  score         name
0  1    0.9   john smith
1  2    0.2   john smith
2  3    0.7  david jones
3  5    0.8   jane smith
4  6    0.3   jane smith

Then pivot your dataframe:

out = (pd.json_normalize(json['results'], 'users', 'name')
         .assign(colid=lambda x: x.groupby('name')['id'].cumcount().add(1).astype(str))
         .pivot('name', 'colid'))
out.columns = out.columns.to_flat_index().map('_'.join).rename(None)
print(out)

# Output
            id_1 id_2  score_1  score_2
name                                   
david jones    3  NaN      0.7      NaN
jane smith     5    6      0.8      0.3
john smith     1    2      0.9      0.2
  • Related