Create many new column df, having a nested column inside that df-CodePudding

I have a data frame that looks like this:

a = {'price': [1, 2],
     'nested_column': 
    [[{'key': 'code', 'value': 'A', 'label': 'rif1'},
    {'key': 'datemod', 'value': '31/09/2022', 'label': 'mod'}],
    [{'key': 'code', 'value': 'B', 'label': 'rif2'},
    {'key': 'datemod', 'value': '31/08/2022', 'label': 'mod'}]]}

df = pd.DataFrame(data=a)

My expected output should look like this:

b = {'price': [1, 2],
    'code':["A","B"],
    'datemod':["31/09/2022","31/08/2022"]}

exp_df = pd.DataFrame(data=b)

I tried some lines of code, that unfortunately don't do the job, that look like this:

df = pd.concat([df.drop(['nested_column'], axis=1), df['nested_column'].apply(pd.Series)], axis=1)
df = pd.concat([df.drop([0], axis=1), df[0].apply(pd.Series)], axis=1)

CodePudding user response：

You can pop and explode your column to feed to json_normalize, then pivot according to the desired key/value and join:

# pop the json column and explode to rows
s = df.pop('nested_column').explode()

df = df.join(pd.json_normalize(s)    # normalize dictionary to columns
               .assign(idx=s.index)  # ensure same index
               .pivot(index='idx', columns='key', values='value')
             )

output:

   price code     datemod
0      1    A  31/09/2022
1      2    B  31/08/2022

CodePudding user response：

Get key: value pairs from nested dictionaries and flatten values by json_normalize:

f = lambda x: {y['key']:y['value'] for y in x for k, v in y.items()}
df['nested_column'] = df['nested_column'].apply(f)
print (df)
   price                           nested_column
0      1  {'code': 'A', 'datemod': '31/09/2022'}
1      2  {'code': 'B', 'datemod': '31/08/2022'}

df1 = df.join(pd.json_normalize(df.pop('nested_column')))
print (df1)
   price code     datemod
0      1    A  31/09/2022
1      2    B  31/08/2022

CodePudding user response：

A more pythonic approach. I create dictionary b from a. I am adding the values to the variable that correspond with the key.

n = len(a['nested_column'])
m = len(a['nested_column'][0])

b = {}
b['price'] = a['price']
for var in ['code', 'datemod']:
    b[var] = [a['nested_column'][i][j]['value'] for i in range(n) for j in range(m) if a['nested_column'][i][j]['key'] == var]

CodePudding user response：

I'm a fan of doing operations such as this outside of Pandas, primarily for speed - can't argue that @mozway's solution is pleasing to the eye though :)

Export df to dictionary

mapping = df.to_dict('records')

Iterate through the dictionary to create a defaultdict dictionary

from collections import defaultdict

out = defaultdict(list)

for entry in mapping:
    for key, value in entry.items():
        if key == 'price':
            out[key].append(value)
        else:
            for ent in value:
                if ent['key'] == "code":
                    out["code"].append(ent["value"])
                else:
                    out["datemod"].append(ent["value"])


pd.DataFrame(out)

   price code     datemod
0      1    A  31/09/2022
1      2    B  31/08/2022

You could reduce the number of trips by iterating through a directly (or exporting df as df.to_dict('list')):

from itertools import chain
out = defaultdict(list)
for key, value in a.items():
    if key == "price":
        out[key].extend(value)
    else:
        value = chain.from_iterable(value)
        for ent in value:
            if ent['key'] == 'code':
                out['code'].append(ent['value'])
            else:
                out['datemod'].append(ent['value'])


pd.DataFrame(out)

   price code     datemod
0      1    A  31/09/2022
1      2    B  31/08/2022