I have extracted data via the github API and then used json.normalise to flatten the data into a dataframe. Unfortunately, some of the data is still in nested dictoinaries in the column. I'm able to extract the value from the dictionary but the problem comes in when there is more than one dictionary in the cell.
How do I manipulate the dataframe so that it resizes to account for the additional values.
Like this:
CodePudding user response:
To reproduce your problem, let's suppose we have this dataframe :
import pandas as pd
df = pd.DataFrame({'ID': [1, 2],
'Pull.Request.Files.Nodes': [[{'path':'example 1'}], [{'path':'example 2'}, {'path':'example 3'}]],
})
df
ID Pull.Request.Files.Nodes
0 1 [{'path': 'example 1'}]
1 2 [{'path': 'example 2'}, {'path': 'example 3'}]
We could explode the column 'Pull.Request.Files.Nodes'
to extract dictionaries from list, and then we could apply a lambda function, like this :
df = df.explode('Pull.Request.Files.Nodes', ignore_index=True)
df['Pull.Request.Files.Nodes'] = df['Pull.Request.Files.Nodes'].apply(lambda r:r['path'])
Complete code
import pandas as pd
df = pd.DataFrame({'ID': [1, 2],
'Pull.Request.Files.Nodes': [[{'path':'example 1'}], [{'path':'example 2'}, {'path':'example 3'}]],
})
df = df.explode('Pull.Request.Files.Nodes', ignore_index=True)
df['Pull.Request.Files.Nodes'] = df['Pull.Request.Files.Nodes'].apply(lambda r:r['path'])
# ID Pull.Request.Files.Nodes
# 0 1 example 1
# 1 2 example 2
# 2 2 example 3