Home > Back-end >  Flatten a Dataframe within a Dataframe
Flatten a Dataframe within a Dataframe

Time:07-28

I have this project where I pull a json from gitlab, filter its contents, and then spit out a csv report. I've been going in circles putting in different solutions, and I did get it to work on my personal PC using two different methods, but neither work in the gitlab environment.

First issue: The dataset I'm pulling is rather peculiar in that its 2d:

{'id':0, 'entity_type':'Project', 'details':{'author_name':'Billy', 'author_id':02}, 'created_at':10242022}

which translates to:

id entity_type details created_at
0 Project {'author_name':'Billy', 'author_id':02} 10-24-2022

What I've been trying to do is flatten it into this:

id entity_type author_name author_id created_at
0 Project 'Billy' 02 10-24-2022

What I have so far is the following, which works on my personal PC, but for some reason not the gitlab:

import pandas as pd

#pull data from the gitlab
df = pd.DataFrame(data)
sub_df = df.details.apply(pd.Series)    #pull out details column
sub_df = pd.concat({'details':sub_df}, axis=1, names=['l1','l2'])    #
df = pd.concat({'':df}, axis=1, names =['l1','l2'])
df = pd.concat((df, sub_df), axis=1)

#code to delete unnecessary data from df.

Now, on my PC when I run this code, pd.Series helps turn the 'details' column into multiple columns of the headers within. However, when I run it on the GitLab, it instead separates into multiple columns of 'details' with each value stored underneath, ie:

details details details details ...
author_name Billy author_id 02 ...
author_name Sam author_id 05 ...

CodePudding user response:

Try with json_normalize instead of apply(pd.Series)

df = pd.concat([df, pd.json_normalize(df.pop('details'))], axis=1)

CodePudding user response:

I assume you have a list of dictionaries as data. I would suggest that you flatten the dictionaries (in the list of dictionaries) before converting it into a DataFrame.

Try this code:

import pandas as pd

# Assuming this is how your data looks:
data = [
    {
        'id':0,
        'entity_type':'Project',
        'details':{'author_name':'Billy','author_id':'02'},
        'created_at':10242022
    },
    {
        'id':1,
        'entity_type':'Project',
        'details':{'author_name':'Sam','author_id':'05'},
        'created_at':10242022
    }
]

flattened_data = [dict(d, **d.pop("details")) for d in data]

df = pd.DataFrame(flattened_data)

df.head()

which results in this:

id entity_type author_name author_id created_at
0 0 Project Billy 02 10242022
1 1 Project Sam 05 10242022
  • Related