How do you unnest JSON, column with a list of dict in a dataframe to a new Dataframe?-CodePudding

I have a dataframe, df_object that has a column that contains a list of dict items.

ObjectID	research
392	{'researchID': '10003', 'research.type': 'x-ray', 'research.date': '2004'}, {'researchID': '10006', 'research.type': 'document', 'research.date': '2005'
394	{'researchID': '10012', 'research.type': 'x-ray', 'research.date': '2005'}

My goal is to have a new dataframe from the research column, that includes the ObjectID, but uses the researchID as the index:

researchID (index)	ObjectID	research.type	research.date
10003	392	x-ray	2004
10006	392	document	2005
10012	394	x-ray	2005

When I use this code: df = pd.DataFrame(df_object['research'][392]) the columns are correct and it looks like this:

researchID (index)	ObjectID	research.type	research.date
10003	392	x-ray	2004
10006	392	document	2005

The issue is that I don't know how to add the rest of the objectID data on. I thought I could iterate all the rows in df_object and append the rows but wasn't sure if that's the best way because I read that you shouldn't keep appending hundreds of rows to dataframes like that because it's slow.

When I tried df = pd.DataFrame(df_object['research']), then the index stays on ObjectID and it's nested.

I tried a few flatten JSON functions from around StackOverflow, but I kept getting errors and I'm not sure if that's what I needed to be searching for.

ADDED Output from: df_object.head(2).to_dict()

{'research': {392: [{'researchID': '10003', 'research.type': ['x-ray'], 'research.date': '2004'},{'researchID': '10006', 'research.type': 'document', 'research.date': '2005'}], 393: {'researchID': '10012', 'research.type': 'x-ray', 'research.date': '2005'}}

CodePudding user response：

I'm assuming that in research column you have list of dict items (as you stated in your question). Then:

df = df.explode("research")
df = pd.concat([df, df.pop("research").apply(pd.Series)], axis=1)
print(df)

Prints:

   ObjectID researchID research.type research.date
0       392      10003         x-ray          2004
0       392      10006      document          2005
1       394      10012         x-ray          2005

To set researchID as index:

print(df.set_index("researchID"))

Prints:

            ObjectID research.type research.date
researchID                                      
10003            392         x-ray          2004
10006            392      document          2005
10012            394         x-ray          2005