Home > Software engineering >  How to pivot value of JSON in a DataFrame
How to pivot value of JSON in a DataFrame

Time:02-15

I have an original DataFrame that contain a list of JSON in each record

Here is an example of the DataFrame

import pandas as pd
d = [{
    'id': 'abcd1234',
    'property_name':'a',
    'timestamp':'2021:12:10',
    'raw_data': [
        {
        'key1':'value1',
        'key2':'value2'
    },
    {
        'key3':'value3',
        'key4':'value4'
    }]
    
},
{   'id':'efgh5678',
    'property_name':'b',
    'timestamp':'2021:12:12',
    'raw_data': [{
        'key5':'value5',
        'key6':'value6'
    }]
}]
df = pd.DataFrame(d)

Here is the output of DataFrame

id  property_name   timestamp   raw_data
0   abcd1234    a   2021:12:10  [{'key1': 'value1', 'key2': 'value2'}, {'key3': 'value3', 'key4': 'value4'}]
1   efgh5678    b   2021:12:12  [{'key5': 'value5', 'key6': 'value6'}]

Now, I want to transform this DataFrame with the following condition,

  1. Pivot the keys of JSON as a column and value of the JSON as a value
  2. Have some prefix of the columns name based on 'property_name' field
  3. For a record with absence value, make it NULL

Here is a desired final DataFrame

id  property_name   timestamp   a_key1  a_key2  a_key3  a_key4  b_key5  b_key6
0   abcd1234    a   2021:12:10  value1  value2  value3  value4  NaN     NaN
1   efgh5678    b   2021:12:12  NaN     NaN     NaN      NaN    value5  value6

So far, I have succeeded it with for loop in each record which is not quite efficient way to achieve since the DataFrame is expected to grow exponentially. Anyway to make this function in the efficient way?

Ps1. The following is the coded that I am using

raw_data = pd.DataFrame()
for idx,val in df["raw_data"].iteritems():
    the_list = json.loads(val)
    temp = pd.DataFrame(the_list, index=[idx]*len(the_list))
    temp['key'] = stg1.loc[idx, 'raw_data']   '_'   temp['key']
    raw_data = pd.concat([raw_data,temp], axis=0)

raw_data = raw_data.set_index("property_name",append=True).unstack()
raw_data.columns = raw_data.columns.droplevel(0)
final = df[['uid', 'property_name', 'timestamp']].join(raw_data)

Ps2. Apologize for the unclear DataFrame format. This is the first time I post on Stack overflow and I don't figure out how to post pretty format of DataFrame.

CodePudding user response:

You could first modify the dicts in "raw_data" to include the property names in the keys. Then set_index with the column names other than "raw_data", then explode the "raw_data" column. Then convert the resulting Series to a DataFrame, groupby first to get a single row for each property name.

df['raw_data'] = [[{f'{pn}_{k}': v for k,v in d.items()} for d in lst] for pn, lst in zip(df['property_name'], df['raw_data'])]
s = df.set_index(df.columns.drop('raw_data').tolist())['raw_data'].explode()
out = pd.DataFrame(s.tolist(), index=s.index).groupby(level=[0,1,2]).first().reset_index().fillna(np.nan)

Output:

         id property_name   timestamp  a_key1  a_key2  a_key3  a_key4  b_key5   b_key6
0  abcd1234             a  2021:12:10  value1  value2  value3  value4     NaN      NaN
1  efgh5678             b  2021:12:12     NaN     NaN     NaN     NaN  value5   value6
  • Related