I have a pandas DF with 5000 rows and 400 columns containing identifiers as an index column with the remaining columns containing key value pairs in the form of a dictionary.
eg.
Identifier 0 1
identifier 1 {'key':'value'} {'key2':'value2'}
I'm trying to get to a point where the data looks like this:
Identifier key key2
identifier1 value value 2
I know we can use the following in order to create a df from a nested dictionary:
df = pd.DataFrame(data).T
But that would require the data to look like this:
data = {'Identifier1': {'key': 'value', 'key2': 'value2'}}
When i use df.to_dict() on my current dataframe in order to get it to the above format, it looks like so:
{'Identifier1':{'0': {'key':'value'}, '1': {'key1':'value1'}...}}
Which I understand why. My question is two fold I suppose:
How can I get my data to a place where df = pd.DataFrame(data).T works ? Is there a better way to go about this?
EXAMPLE as requested:
Data looks like this:
FileID | 0 | 1 |
---|---|---|
file1 | {'key1':value1} | {'key3':value1} |
file2 | {'key2':value1} | {'key1':value2} |
I'd like it to look like this:
FileID | key 1 | key2 | key3 |
---|---|---|---|
file1 | value1 | value1 | |
file2 | value2 | value 1 |
EDIT 2: There are nonetypes within the dataframe I am working with as well. In trying out the solutions myself, I've come to realize that. Apologies for not including that information initially.
CodePudding user response:
UPDATED:
Here is my understanding of the question (as updated, including in the comments, by OP):
- Given a dataframe with:
- leftmost column
FileID
and an arbitrary number of additional columns with each value in each row containing either None or a dict with 0 or more key/value pairs
- leftmost column
- Produce a dataframe with:
- the same leftmost column
FileID
and one additional column for each unique key in the union of all keys in the dicts contained in the original dataframe - in each row, the dict value for the key equal to the column label which was found in the original dataframe in the row with matching
FileID
- the same leftmost column
Here is code to do what is asked:
df = pd.concat([df.FileID.to_frame(), df.apply(lambda x: {'FileID':x.FileID} |
{k:v for col in df.columns[1:] for k, v in (x[col] or {}).items()}, axis=1)], axis=1)
def foo(x):
for y in x:
if y is not np.nan:
return y
return np.nan
df2 = pd.DataFrame.from_records(df.set_index('FileID').stack()).groupby('FileID').agg(foo).reset_index()
df2 = df2[['FileID'] sorted(df2.columns[1:])]
Explanation:
- use
apply
to create one dict per row which is the union of the dicts in each column as well as a key/value pair for the row'sFileID
- use
concat
to have the leftmost column be unchanged asFileID
- use
set_index
andstack
to prepare the above dataframe to be used as a sequence of dicts to initialize a new dataframe usingfrom_records
such that it will have a column for each unique key in the union of all dicts in the original dataframe - use
agg
with a custom function that for eachFileID
delivers the first non-NaN value for each key column, or NaN if that key is not found within the group for a givenFileID
- use
reset_index
to restoreFileID
as a column - (optional) sort the column labels
Full test code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'FileID':['file1','file2'], 0:[{'key1':'value1'}, {'key2':'value1'}], 1:[{'key3':'value1'}, None], 2:[{}, {'key1':'value2'}]})
print(df)
df = pd.concat([df.FileID.to_frame(), df.apply(lambda x: {'FileID':x.FileID} |
{k:v for col in df.columns[1:] for k, v in (x[col] or {}).items()}, axis=1)], axis=1)
def foo(x):
for y in x:
if y is not np.nan:
return y
return np.nan
df2 = pd.DataFrame.from_records(df.set_index('FileID').stack()).groupby('FileID').agg(foo).reset_index()
df2 = df2[['FileID'] sorted(df2.columns[1:])]
print(df2)
Input:
FileID 0 1 2
0 file1 {'key1': 'value1'} {'key3': 'value1'} {}
1 file2 {'key2': 'value1'} None {'key1': 'value2'}
Output:
FileID key1 key2 key3
0 file1 value1 NaN value1
1 file2 value2 value1 NaN
CodePudding user response:
A combination of melting, popitem(), and pivoting back
import pandas as pd
#example table
df = pd.DataFrame({
'FileID': {0: 'File_0', 1: 'File_1', 2: 'File_2'},
'a': {0: {'key_5': 'value0'}, 1: {'key_11': 'value6'}, 2: {'key_12': 'value3'}},
'b': {0: {'key_8': 'value12'}, 1: {'key_9': 'value1'}, 2: {'key_11': 'value2'}}}
)
print(df)
# FileID a b
#0 File_0 {'key_5': 'value0'} {'key_8': 'value12'}
#1 File_1 {'key_11': 'value6'} {'key_9': 'value1'}
#2 File_2 {'key_12': 'value3'} {'key_11': 'value2'}
#Melt to long form and extract the key/vals's with popitem
df = df.melt(id_vars='FileID')
df[['key','value']] = df['value'].apply(lambda d: d.popitem()).to_list()
#will fail if the same key is used multiple times in the same row across different columns
df = df.pivot(index='FileID',columns='key',values='value').reset_index()
df.columns.name = ''
print(df)
# FileID key_11 key_12 key_5 key_8 key_9
#0 File_0 NaN NaN value0 value12 NaN
#1 File_1 value6 NaN NaN NaN value1
#2 File_2 value2 value3 NaN NaN NaN