I have a dataframe, df_object
that has a column that contains a list of dict items.
ObjectID | research |
---|---|
392 | {'researchID': '10003', 'research.type': 'x-ray', 'research.date': '2004'}, {'researchID': '10006', 'research.type': 'document', 'research.date': '2005' |
394 | {'researchID': '10012', 'research.type': 'x-ray', 'research.date': '2005'} |
My goal is to have a new dataframe from the research column, that includes the ObjectID, but uses the researchID as the index:
researchID (index) | ObjectID | research.type | research.date |
---|---|---|---|
10003 | 392 | x-ray | 2004 |
10006 | 392 | document | 2005 |
10012 | 394 | x-ray | 2005 |
When I use this code: df = pd.DataFrame(df_object['research'][392])
the columns are correct and it looks like this:
researchID (index) | ObjectID | research.type | research.date |
---|---|---|---|
10003 | 392 | x-ray | 2004 |
10006 | 392 | document | 2005 |
The issue is that I don't know how to add the rest of the objectID data on. I thought I could iterate all the rows in df_object and append the rows but wasn't sure if that's the best way because I read that you shouldn't keep appending hundreds of rows to dataframes like that because it's slow.
When I tried df = pd.DataFrame(df_object['research'])
, then the index stays on ObjectID and it's nested.
I tried a few flatten JSON functions from around StackOverflow, but I kept getting errors and I'm not sure if that's what I needed to be searching for.
ADDED
Output from: df_object.head(2).to_dict()
{'research': {392: [{'researchID': '10003', 'research.type': ['x-ray'], 'research.date': '2004'},{'researchID': '10006', 'research.type': 'document', 'research.date': '2005'}], 393: {'researchID': '10012', 'research.type': 'x-ray', 'research.date': '2005'}}
CodePudding user response:
I'm assuming that in research
column you have list of dict items (as you stated in your question). Then:
df = df.explode("research")
df = pd.concat([df, df.pop("research").apply(pd.Series)], axis=1)
print(df)
Prints:
ObjectID researchID research.type research.date
0 392 10003 x-ray 2004
0 392 10006 document 2005
1 394 10012 x-ray 2005
To set researchID
as index:
print(df.set_index("researchID"))
Prints:
ObjectID research.type research.date
researchID
10003 392 x-ray 2004
10006 392 document 2005
10012 394 x-ray 2005