I'm trying to normalize data extracted from my database. Among this data there is column called ExtraData
which is represented in nested jsons. My ExtraData
JSON can be one of these cases :
{"Data":{"Parties":[{"ID":"JackyID","Role":12}],"NbIDs":1}} #One party identified
{"Data":{"Parties":[{"ID":"JackyID","Role":12},{"ID":"SamNumber","Role":10}],"NbIDs":2}} #Two Parties identified
{"Data":{"Parties":[],"NbIDs":0}} #No parties identified
{"Data": None} #No data
When looking to extract the values of ID (ID of the party - String datatype) and Role (Int datatype - refer to buyers when Role=12 and sellers when Role=10) and write "" when nothing is present , I'm using the following method :
def parse_extra_data(data_str):
response = {'Buyer': '', 'Seller': ''}
try:
data_json = json.loads(data_str)
except:
return response
for party in data_json['Data']['Parties']:
if party['Role'] == 12:
response['Buyer'] = party['ID']
elif party['PartyRole'] == 122:
response['Seller'] = party['ID']
return response
Now, when I want to apply this method to my code :
import json
import pandas.io.json as pd_json
query="SELECT OrderID, ExtraData from tbl_data;"
test_data= crsr.execute(query)
columns_test = [column[0] for column in crsr.description]
rows = [list(x) for x in test_data]
df = pd.DataFrame(rows, columns=columns_test)
for i,row in df.iterrows():
test = json.dumps(row['Data'])
data = pd_json.loads(test)
data_json = json.loads(data)
df['Buyer'] = df.apply(lambda row: parse_extra_data(data_json['Data'])['Buyer'], axis=1)
df['Seller'] = df.apply(lambda row: parse_extra_data(data_json['Data'])['Seller'], axis=1)
df.rename(columns={
'OrderID' :'ID of the order'
}, inplace = True)
df = df[['ID of the order','Buyer', 'Seller']]
When executing this code, df is the following :
>>print(df)
ID of the order |Buyer | Seller
--------------------------------------
321 | |
456 | |
789 | |
987 | |
And print(data_json)
show only the first JSON from ExtraData
.
What am I doing wrong ? and how to fix that ? If we take the scenario above as the database input, df should look like :
>>print(df)
ID of the order |Buyer | Seller
---------------------------------------
321 |JackyID| #Transaction 1 we have info about the buyer
456 |JackyID| SamNumber #Transaction 2 we have infos about the buyer and the seller
789 | | #Transaction 3 we don't have any infos about the parties
987 | | #Transaction 4 we don't have any infos about the parties
CodePudding user response:
Well, there isn't data from database as example to try to replicate the error, despite that I think the error is in the df.apply
lines (take a look how to use it), because you're using a lambda
function with row as parameter, but it's never called inside lambda
definition what it means none of columns is gonna be affected by the parse_extra_data
function.
Example how to use apply
:
df.apply(lambda x: func(x['col1'],x['col2']),axis=1)
CodePudding user response:
Given:
data = [{'Data': {'Parties': [{'ID': 'JackyID', 'Role': 12}], 'NbIDs': 1}}, {'Data': {'Parties': [{'ID': 'JackyID', 'Role': 12}, {'ID': 'SamNumber', 'Role': 10}], 'NbIDs': 2}}, {'Data': {'Parties': [], 'NbIDs': 0}}, {'Data': None}]
Doing:
df = pd.json_normalize([x['Data'] for x in data if x['Data']], 'Parties', 'NbIDs')
df.Role.replace({12:'Buyer', 10:'Seller'}, inplace=True)
df = df.pivot(columns='Role', index='NbIDs', values='ID')
print(df)
Output:
Role Buyer Seller
NbIDs
1 JackyID NaN
2 JackyID SamNumber