Home > Software engineering >  Pandas data structure help - dealing with nested JSON
Pandas data structure help - dealing with nested JSON

Time:05-10

I'm trying to normalize data extracted from my database. Among this data there is column called ExtraData which is represented in nested jsons. My ExtraData JSON can be one of these cases :

{"Data":{"Parties":[{"ID":"JackyID","Role":12}],"NbIDs":1}} #One party identified
{"Data":{"Parties":[{"ID":"JackyID","Role":12},{"ID":"SamNumber","Role":10}],"NbIDs":2}} #Two Parties identified
{"Data":{"Parties":[],"NbIDs":0}} #No parties identified
{"Data": None} #No data

When looking to extract the values of ID (ID of the party - String datatype) and Role (Int datatype - refer to buyers when Role=12 and sellers when Role=10) and write "" when nothing is present , I'm using the following method :

def parse_extra_data(data_str):
    response = {'Buyer': '', 'Seller': ''}
    try:
        data_json = json.loads(data_str)
    except:
        return response
    for party in data_json['Data']['Parties']:
        if party['Role'] == 12:
            response['Buyer'] = party['ID']
        elif party['PartyRole'] == 122:
            response['Seller'] = party['ID']        
    return response

Now, when I want to apply this method to my code :

import json
import pandas.io.json as pd_json

query="SELECT OrderID, ExtraData from tbl_data;"
test_data= crsr.execute(query)
columns_test = [column[0] for column in crsr.description]
rows = [list(x) for x in test_data]
df = pd.DataFrame(rows, columns=columns_test)
for i,row in df.iterrows():
    test = json.dumps(row['Data'])
    data = pd_json.loads(test)
    data_json = json.loads(data)
    df['Buyer'] = df.apply(lambda row: parse_extra_data(data_json['Data'])['Buyer'], axis=1)
    df['Seller'] = df.apply(lambda row: parse_extra_data(data_json['Data'])['Seller'], axis=1)
df.rename(columns={
    'OrderID' :'ID of the order'
 }, inplace = True) 
df = df[['ID of the order','Buyer', 'Seller']]

When executing this code, df is the following :

>>print(df)
ID of the order   |Buyer     | Seller
--------------------------------------
321               |          |              
456               |          |    
789               |          |    
987               |          |            

And print(data_json) show only the first JSON from ExtraData.

What am I doing wrong ? and how to fix that ? If we take the scenario above as the database input, df should look like :

>>print(df)
ID of the order   |Buyer  | Seller
---------------------------------------
321               |JackyID|              #Transaction 1 we have info about the buyer
456               |JackyID| SamNumber    #Transaction 2 we have infos about the buyer and the seller
789               |       |              #Transaction 3 we don't have any infos about the parties
987               |       |              #Transaction 4 we don't have any infos about the parties

CodePudding user response:

Well, there isn't data from database as example to try to replicate the error, despite that I think the error is in the df.apply lines (take a look how to use it), because you're using a lambda function with row as parameter, but it's never called inside lambda definition what it means none of columns is gonna be affected by the parse_extra_data function.

Example how to use apply:

df.apply(lambda x: func(x['col1'],x['col2']),axis=1)

CodePudding user response:

Given:

data = [{'Data': {'Parties': [{'ID': 'JackyID', 'Role': 12}], 'NbIDs': 1}}, {'Data': {'Parties': [{'ID': 'JackyID', 'Role': 12}, {'ID': 'SamNumber', 'Role': 10}], 'NbIDs': 2}}, {'Data': {'Parties': [], 'NbIDs': 0}}, {'Data': None}]

Doing:

df = pd.json_normalize([x['Data'] for x in data if x['Data']], 'Parties', 'NbIDs')
df.Role.replace({12:'Buyer', 10:'Seller'}, inplace=True)
df = df.pivot(columns='Role', index='NbIDs', values='ID')
print(df)

Output:

Role     Buyer     Seller
NbIDs
1      JackyID        NaN
2      JackyID  SamNumber
  • Related