mapping some key-value pairs from nested json to new columns in Pandas dataframe-CodePudding

I spent a few hours searching for hints on how to do this, and tried a bunch of things (see below). I'm not getting anywhere, so I finally decided to post a new question.

I have a nested JSON with a dictionary data structure, like this:

for k,v in d.items():
    print(f'{k} = {v}')

First two keys:

obj1 = {'color': 'red', 'size': 'large', 'description': 'a large red ball'}
obj2 = {'color': 'blue', 'size': 'small', 'description': 'a small blue ball'}

Side question: is this actually a nested json? Each key (obj1, obj2) has a set of keys, so I think so but I'm not sure.

I then have a dataframe like this:

df
key   id_num   source
obj1  143  loc1
obj2  139  loc1

I want to map only 'size' and 'description' from my json dictionary to this dataframe, by key. And I want to do that efficiently and readably. I also want it to be robust to the presence of the key, so that if a key doesn't exist in the JSON dict, it just prints "NA" or something.

Things I've tried that got me closest (I tried to map one column at a time, and both at same time):

df['description'] = df['key'].map(d['description'])

df['description'] = df['key'].map(lambda x: d[x]['description'])

df2 = df.join(pd.DataFrame.from_dict(d, orient='index', columns=['size','description']), on='key')

The first one - it's obvious why this doesn't work. It prints KeyError: 'description', as expected. The second one I think would work, but there is a key in my dataframe that doesn't exist in my JSON dict. It prints KeyError: 'obj42' (an object in my df but not in d). The third one works, but requires creating a new dataframe which I'd like to avoid.

How can I make Solution #2 robust to missing keys? Also, is there a way to assign both columns at the same time without creating a new df? I found a way to assign all values in the dict here, but that's not what I want. I only want a couple.

There's always a possibility that my search keywords were not quite right, so if a post exists that answers my question please do let me know and I can delete this one.

CodePudding user response：

One way to go, based on your second attempt, would be as follows:

import pandas as pd
import numpy as np

d = {'obj1': {'color': 'red', 'size': 'large', 'description': 'a large red ball'},
     'obj2': {'color': 'blue', 'size': 'small', 'description': 'a small blue ball'}
     }

# just adding `obj3` here to supply a `KeyError`
data = {'key': {0: 'obj1', 1: 'obj2', 2: 'obj3'},
 'id_num': {0: 143, 1: 139, 2: 140},
 'source': {0: 'loc1', 1: 'loc1', 2: 'loc1'}}

df = pd.DataFrame(data)

df[['size','description']] = df['key'].map(lambda x: [d[x]['size'], d[x]['description']] if x in d else [np.nan]*2).tolist()

print(df)

    key  id_num source   size        description
0  obj1     143   loc1  large   a large red ball
1  obj2     139   loc1  small  a small blue ball
2  obj3     140   loc1    NaN                NaN

CodePudding user response：

You can create a dataframe from the dictionary and then do .merge:

df = df.merge(
    pd.DataFrame(d.values(), index=d.keys())[["size", "description"]],
    left_on="key",
    right_index=True,
    how="left",
)
print(df)

Prints:

    key  id_num source   size        description
0  obj1     143   loc1  large   a large red ball
1  obj2     139   loc1  small  a small blue ball
2  obj3     140   loc1    NaN                NaN

Data used:

d = {
    "obj1": {
        "color": "red",
        "size": "large",
        "description": "a large red ball",
    },
    "obj2": {
        "color": "blue",
        "size": "small",
        "description": "a small blue ball",
    },
}

data = {
    "key": {0: "obj1", 1: "obj2", 2: "obj3"},
    "id_num": {0: 143, 1: 139, 2: 140},
    "source": {0: "loc1", 1: "loc1", 2: "loc1"},
}
df = pd.DataFrame(data)