I have this project where I pull a json from gitlab, filter its contents, and then spit out a csv report. I've been going in circles putting in different solutions, and I did get it to work on my personal PC using two different methods, but neither work in the gitlab environment.
First issue: The dataset I'm pulling is rather peculiar in that its 2d:
{'id':0, 'entity_type':'Project', 'details':{'author_name':'Billy', 'author_id':02}, 'created_at':10242022}
which translates to:
id | entity_type | details | created_at |
---|---|---|---|
0 | Project | {'author_name':'Billy', 'author_id':02} | 10-24-2022 |
What I've been trying to do is flatten it into this:
id | entity_type | author_name | author_id | created_at |
---|---|---|---|---|
0 | Project | 'Billy' | 02 | 10-24-2022 |
What I have so far is the following, which works on my personal PC, but for some reason not the gitlab:
import pandas as pd
#pull data from the gitlab
df = pd.DataFrame(data)
sub_df = df.details.apply(pd.Series) #pull out details column
sub_df = pd.concat({'details':sub_df}, axis=1, names=['l1','l2']) #
df = pd.concat({'':df}, axis=1, names =['l1','l2'])
df = pd.concat((df, sub_df), axis=1)
#code to delete unnecessary data from df.
Now, on my PC when I run this code, pd.Series helps turn the 'details' column into multiple columns of the headers within. However, when I run it on the GitLab, it instead separates into multiple columns of 'details' with each value stored underneath, ie:
details | details | details | details | ... |
---|---|---|---|---|
author_name | Billy | author_id | 02 | ... |
author_name | Sam | author_id | 05 | ... |
CodePudding user response:
Try with json_normalize
instead of apply(pd.Series)
df = pd.concat([df, pd.json_normalize(df.pop('details'))], axis=1)
CodePudding user response:
I assume you have a list of dictionaries as data. I would suggest that you flatten the dictionaries (in the list of dictionaries) before converting it into a DataFrame
.
Try this code:
import pandas as pd
# Assuming this is how your data looks:
data = [
{
'id':0,
'entity_type':'Project',
'details':{'author_name':'Billy','author_id':'02'},
'created_at':10242022
},
{
'id':1,
'entity_type':'Project',
'details':{'author_name':'Sam','author_id':'05'},
'created_at':10242022
}
]
flattened_data = [dict(d, **d.pop("details")) for d in data]
df = pd.DataFrame(flattened_data)
df.head()
which results in this:
id | entity_type | author_name | author_id | created_at | |
---|---|---|---|---|---|
0 | 0 | Project | Billy | 02 | 10242022 |
1 | 1 | Project | Sam | 05 | 10242022 |