I have an original DataFrame that contain a list of JSON in each record
Here is an example of the DataFrame
import pandas as pd
d = [{
'id': 'abcd1234',
'property_name':'a',
'timestamp':'2021:12:10',
'raw_data': [
{
'key1':'value1',
'key2':'value2'
},
{
'key3':'value3',
'key4':'value4'
}]
},
{ 'id':'efgh5678',
'property_name':'b',
'timestamp':'2021:12:12',
'raw_data': [{
'key5':'value5',
'key6':'value6'
}]
}]
df = pd.DataFrame(d)
Here is the output of DataFrame
id property_name timestamp raw_data
0 abcd1234 a 2021:12:10 [{'key1': 'value1', 'key2': 'value2'}, {'key3': 'value3', 'key4': 'value4'}]
1 efgh5678 b 2021:12:12 [{'key5': 'value5', 'key6': 'value6'}]
Now, I want to transform this DataFrame with the following condition,
- Pivot the keys of JSON as a column and value of the JSON as a value
- Have some prefix of the columns name based on 'property_name' field
- For a record with absence value, make it NULL
Here is a desired final DataFrame
id property_name timestamp a_key1 a_key2 a_key3 a_key4 b_key5 b_key6
0 abcd1234 a 2021:12:10 value1 value2 value3 value4 NaN NaN
1 efgh5678 b 2021:12:12 NaN NaN NaN NaN value5 value6
So far, I have succeeded it with for loop in each record which is not quite efficient way to achieve since the DataFrame is expected to grow exponentially. Anyway to make this function in the efficient way?
Ps1. The following is the coded that I am using
raw_data = pd.DataFrame()
for idx,val in df["raw_data"].iteritems():
the_list = json.loads(val)
temp = pd.DataFrame(the_list, index=[idx]*len(the_list))
temp['key'] = stg1.loc[idx, 'raw_data'] '_' temp['key']
raw_data = pd.concat([raw_data,temp], axis=0)
raw_data = raw_data.set_index("property_name",append=True).unstack()
raw_data.columns = raw_data.columns.droplevel(0)
final = df[['uid', 'property_name', 'timestamp']].join(raw_data)
Ps2. Apologize for the unclear DataFrame format. This is the first time I post on Stack overflow and I don't figure out how to post pretty format of DataFrame.
CodePudding user response:
You could first modify the dicts in "raw_data" to include the property names in the keys. Then set_index
with the column names other than "raw_data", then explode
the "raw_data" column. Then convert the resulting Series to a DataFrame, groupby
first
to get a single row for each property name.
df['raw_data'] = [[{f'{pn}_{k}': v for k,v in d.items()} for d in lst] for pn, lst in zip(df['property_name'], df['raw_data'])]
s = df.set_index(df.columns.drop('raw_data').tolist())['raw_data'].explode()
out = pd.DataFrame(s.tolist(), index=s.index).groupby(level=[0,1,2]).first().reset_index().fillna(np.nan)
Output:
id property_name timestamp a_key1 a_key2 a_key3 a_key4 b_key5 b_key6
0 abcd1234 a 2021:12:10 value1 value2 value3 value4 NaN NaN
1 efgh5678 b 2021:12:12 NaN NaN NaN NaN value5 value6