I spent a few hours searching for hints on how to do this, and tried a bunch of things (see below). I'm not getting anywhere, so I finally decided to post a new question.
I have a nested JSON with a dictionary data structure, like this:
for k,v in d.items():
print(f'{k} = {v}')
First two keys:
obj1 = {'color': 'red', 'size': 'large', 'description': 'a large red ball'}
obj2 = {'color': 'blue', 'size': 'small', 'description': 'a small blue ball'}
Side question: is this actually a nested json? Each key (obj1, obj2) has a set of keys, so I think so but I'm not sure.
I then have a dataframe like this:
df
key id_num source
obj1 143 loc1
obj2 139 loc1
I want to map only 'size' and 'description' from my json dictionary to this dataframe, by key. And I want to do that efficiently and readably. I also want it to be robust to the presence of the key, so that if a key doesn't exist in the JSON dict, it just prints "NA" or something.
Things I've tried that got me closest (I tried to map one column at a time, and both at same time):
df['description'] = df['key'].map(d['description'])
df['description'] = df['key'].map(lambda x: d[x]['description'])
df2 = df.join(pd.DataFrame.from_dict(d, orient='index', columns=['size','description']), on='key')
The first one - it's obvious why this doesn't work. It prints KeyError: 'description', as expected. The second one I think would work, but there is a key in my dataframe that doesn't exist in my JSON dict. It prints KeyError: 'obj42' (an object in my df but not in d). The third one works, but requires creating a new dataframe which I'd like to avoid.
How can I make Solution #2 robust to missing keys? Also, is there a way to assign both columns at the same time without creating a new df? I found a way to assign all values in the dict here, but that's not what I want. I only want a couple.
There's always a possibility that my search keywords were not quite right, so if a post exists that answers my question please do let me know and I can delete this one.
CodePudding user response:
One way to go, based on your second attempt, would be as follows:
import pandas as pd
import numpy as np
d = {'obj1': {'color': 'red', 'size': 'large', 'description': 'a large red ball'},
'obj2': {'color': 'blue', 'size': 'small', 'description': 'a small blue ball'}
}
# just adding `obj3` here to supply a `KeyError`
data = {'key': {0: 'obj1', 1: 'obj2', 2: 'obj3'},
'id_num': {0: 143, 1: 139, 2: 140},
'source': {0: 'loc1', 1: 'loc1', 2: 'loc1'}}
df = pd.DataFrame(data)
df[['size','description']] = df['key'].map(lambda x: [d[x]['size'], d[x]['description']] if x in d else [np.nan]*2).tolist()
print(df)
key id_num source size description
0 obj1 143 loc1 large a large red ball
1 obj2 139 loc1 small a small blue ball
2 obj3 140 loc1 NaN NaN
CodePudding user response:
You can create a dataframe from the dictionary and then do .merge
:
df = df.merge(
pd.DataFrame(d.values(), index=d.keys())[["size", "description"]],
left_on="key",
right_index=True,
how="left",
)
print(df)
Prints:
key id_num source size description
0 obj1 143 loc1 large a large red ball
1 obj2 139 loc1 small a small blue ball
2 obj3 140 loc1 NaN NaN
Data used:
d = {
"obj1": {
"color": "red",
"size": "large",
"description": "a large red ball",
},
"obj2": {
"color": "blue",
"size": "small",
"description": "a small blue ball",
},
}
data = {
"key": {0: "obj1", 1: "obj2", 2: "obj3"},
"id_num": {0: 143, 1: 139, 2: 140},
"source": {0: "loc1", 1: "loc1", 2: "loc1"},
}
df = pd.DataFrame(data)