Home > Software design >  Converting Keys of Pandas DF of Dictionaries to Columns and values as Data
Converting Keys of Pandas DF of Dictionaries to Columns and values as Data

Time:05-27

I have a pandas DF with 5000 rows and 400 columns containing identifiers as an index column with the remaining columns containing key value pairs in the form of a dictionary.

eg.

Identifier             0               1
identifier 1     {'key':'value'}     {'key2':'value2'}

I'm trying to get to a point where the data looks like this:

Identifier             key           key2
identifier1          value          value 2

I know we can use the following in order to create a df from a nested dictionary:

df = pd.DataFrame(data).T

But that would require the data to look like this:

data = {'Identifier1': {'key': 'value', 'key2': 'value2'}}

When i use df.to_dict() on my current dataframe in order to get it to the above format, it looks like so:

{'Identifier1':{'0': {'key':'value'}, '1': {'key1':'value1'}...}}

Which I understand why. My question is two fold I suppose:

How can I get my data to a place where df = pd.DataFrame(data).T works ? Is there a better way to go about this?

EXAMPLE as requested:

Data looks like this:

FileID 0 1
file1 {'key1':value1} {'key3':value1}
file2 {'key2':value1} {'key1':value2}

I'd like it to look like this:

FileID key 1 key2 key3
file1 value1 value1
file2 value2 value 1

EDIT 2: There are nonetypes within the dataframe I am working with as well. In trying out the solutions myself, I've come to realize that. Apologies for not including that information initially.

CodePudding user response:

UPDATED:

Here is my understanding of the question (as updated, including in the comments, by OP):

  • Given a dataframe with:
    • leftmost column FileID and an arbitrary number of additional columns with each value in each row containing either None or a dict with 0 or more key/value pairs
  • Produce a dataframe with:
    • the same leftmost column FileID and one additional column for each unique key in the union of all keys in the dicts contained in the original dataframe
    • in each row, the dict value for the key equal to the column label which was found in the original dataframe in the row with matching FileID

Here is code to do what is asked:

df = pd.concat([df.FileID.to_frame(), df.apply(lambda x: {'FileID':x.FileID} | 
    {k:v for col in df.columns[1:] for k, v in (x[col] or {}).items()}, axis=1)], axis=1)
def foo(x):
    for y in x:
        if y is not np.nan:
            return y
    return np.nan
df2 = pd.DataFrame.from_records(df.set_index('FileID').stack()).groupby('FileID').agg(foo).reset_index()
df2 = df2[['FileID']   sorted(df2.columns[1:])]

Explanation:

  • use apply to create one dict per row which is the union of the dicts in each column as well as a key/value pair for the row's FileID
  • use concat to have the leftmost column be unchanged as FileID
  • use set_index and stack to prepare the above dataframe to be used as a sequence of dicts to initialize a new dataframe using from_records such that it will have a column for each unique key in the union of all dicts in the original dataframe
  • use agg with a custom function that for each FileID delivers the first non-NaN value for each key column, or NaN if that key is not found within the group for a given FileID
  • use reset_index to restore FileID as a column
  • (optional) sort the column labels

Full test code:

import pandas as pd
import numpy as np

df = pd.DataFrame({'FileID':['file1','file2'], 0:[{'key1':'value1'}, {'key2':'value1'}], 1:[{'key3':'value1'}, None], 2:[{}, {'key1':'value2'}]})
print(df)

df = pd.concat([df.FileID.to_frame(), df.apply(lambda x: {'FileID':x.FileID} | 
    {k:v for col in df.columns[1:] for k, v in (x[col] or {}).items()}, axis=1)], axis=1)
def foo(x):
    for y in x:
        if y is not np.nan:
            return y
    return np.nan
df2 = pd.DataFrame.from_records(df.set_index('FileID').stack()).groupby('FileID').agg(foo).reset_index()
df2 = df2[['FileID']   sorted(df2.columns[1:])]
print(df2)

Input:

  FileID                   0                   1                   2
0  file1  {'key1': 'value1'}  {'key3': 'value1'}                  {}
1  file2  {'key2': 'value1'}                None  {'key1': 'value2'}

Output:

  FileID    key1    key2    key3
0  file1  value1     NaN  value1
1  file2  value2  value1     NaN

CodePudding user response:

A combination of melting, popitem(), and pivoting back

import pandas as pd

#example table
df = pd.DataFrame({
    'FileID': {0: 'File_0', 1: 'File_1', 2: 'File_2'},
    'a': {0: {'key_5': 'value0'}, 1: {'key_11': 'value6'}, 2: {'key_12': 'value3'}}, 
    'b': {0: {'key_8': 'value12'}, 1: {'key_9': 'value1'}, 2: {'key_11': 'value2'}}}
)
print(df)
#   FileID                     a                     b
#0  File_0   {'key_5': 'value0'}  {'key_8': 'value12'}
#1  File_1  {'key_11': 'value6'}   {'key_9': 'value1'}
#2  File_2  {'key_12': 'value3'}  {'key_11': 'value2'}

#Melt to long form and extract the key/vals's with popitem
df = df.melt(id_vars='FileID')
df[['key','value']] = df['value'].apply(lambda d: d.popitem()).to_list()

#will fail if the same key is used multiple times in the same row across different columns
df = df.pivot(index='FileID',columns='key',values='value').reset_index()
df.columns.name = ''

print(df)
#   FileID  key_11  key_12   key_5    key_8   key_9
#0  File_0     NaN     NaN  value0  value12     NaN
#1  File_1  value6     NaN     NaN      NaN  value1
#2  File_2  value2  value3     NaN      NaN     NaN
  • Related