Home > Software engineering >  How to get metadata when reading nested json with pandas
How to get metadata when reading nested json with pandas

Time:02-02

I'm trying to get the metadata out from a json using pandas json_normalize, but it does not work as expected.

I have a json fine with the following structure

data=[
    {'a':'aa',
    'b':{'b1':'bb1','b2':'bb2'},
    'c':[{
        'ca':[{'ca1':'caa1'
            }]
        }]
    }]  

I'd like to get the following

ca1 a b.b1
caa1 aa bb1

I would expect this to work

pd.json_normalize(data, record_path=['c','ca'], meta = ['a',['b','b1']])

but it doesn't find the key b1. Strangely enough if my record_path is 'c' alone it does find the key. I feel I'm missing something here, but I can't figure out what. I appreciate any help!

CodePudding user response:

Going down first level you grab the meta as a list of columns you want to keep. Record path use a list to map levels that you want to go down. Finally column b is a dict you can apply to a Series concat back into df and pop to remove unpacked dict column.

df = pd.json_normalize(
    data=data,
    meta=['a', 'b'],
    record_path=['c', 'ca']
)
df = pd.concat([df.drop(['b'], axis=1), df['b'].apply(pd.Series)], axis=1)
print(df)

Output:

     ca1   a   b1   b2
0  caa1  aa  bb1  bb2

CodePudding user response:

This is a workaround I used eventually

data=[
    {'a':'aa',
    'b':{'b1':'bb1','b2':'bb2'},
    'c':[{
        'ca':[{'ca1':'caa1'
              }]
    }]
    }]
  

df = pd.json_normalize(data, record_path=['c','ca'], meta = ['a',['b']]
                 )

df = pd.concat([df,pd.json_normalize(df['b'])],axis = 1)
df.drop(columns='b',inplace = True)

I still think there should be a better way, but it works

  • Related