Home > Software engineering >  Expand pandas columns with dict while keeping fragment of column name
Expand pandas columns with dict while keeping fragment of column name

Time:10-31

I have a dataset in which some columns have lookup values. There are several such columns in the dataset. I need to expand these columns so that the column name consists of the name of the column itself and the keys in the dict.

Example df:

df
col1    col2                         col3
a       '{key_1: 1a, key_2: 2a}'     '{key_3: 1a, key_4: 2a}'
b       '{key_1: 1b, key_2: 2b}'     '{key_3: 1a, key_4: 2a}'
c       '{key_1: 1c, key_2: 2c}'     '{key_3: 1a, key_4: 2a}'

Desired result:

df_res
col1     col2_key_1   col2_key_2    col3_key_3    col3_key_4
a        1a           2a            1a            2a
b        1b           2b            1a            2a
c        1c           2c            1a            2a

How can I do that?

CodePudding user response:

If in columns are dictionaries, not strings use list comprehension with json_normalize:

cols = ['col2','col3']

print (type(df['col2'].iat[0]))
<class 'dict'>

dfs = [pd.json_normalize(df.pop(x)).add_prefix(f'{x}_') for x in cols]
df = df.join(pd.concat(dfs, axis=1))

print (df)
  col1 col2_key_1 col2_key_2 col3_key_3 col3_key_4
0    a         1a         2a         1a         2a
1    b         1b         2a         1a         2a
2    c         1c         2a         1a         2a

Solution with strings possible converted to dictionaries:

print (df)
  col1                            col2                            col3
0    a  {'key_1': '1a', 'key_2': '2a'}  {'key_3': '1a', 'key_4': '2a'}
1    b  {'key_1': '1b', 'key_2': '2a'}  {'key_3': '1a', 'key_4': '2a'}
2    c  {'key_1': '1c', 'key_2': '2a'}  {'key_3': '1a', 'key_4': '2a'}

cols = ['col2','col3']

    
print (type(df['col2'].iat[0]))
<class 'str'>

import ast

dfs = [pd.json_normalize(df.pop(x).apply(ast.literal_eval)).add_prefix(f'{x}_') 
        for x in cols]
df = df.join(pd.concat(dfs, axis=1))

print (df)
  col1 col2_key_1 col2_key_2 col3_key_3 col3_key_4
0    a         1a         2a         1a         2a
1    b         1b         2a         1a         2a
2    c         1c         2a         1a         2a

EDIT: Solution for original format with custom function:

print (df)
  col1                      col2                      col3
0    a  '{key_1: 1a, key_2: 2a}'  '{key_3: 1a, key_4: 2a}'
1    b  '{key_1: 1b, key_2: 2b}'  '{key_3: 1a, key_4: 2a}'
2    c  '{key_1: 1c, key_2: 2c}'  '{key_3: 1a, key_4: 2a}'

cols = ['col2','col3']

print (type(df['col2'].iat[0]))
# <class 'str'>


f = lambda x: dict([x.split(': ') for x in x.strip("{'}").split(', ')])
dfs = [pd.json_normalize(df.pop(x).apply(f)).add_prefix(f'{x}_') 
        for x in cols]
df = df.join(pd.concat(dfs, axis=1))

print (df)
  col1 col2_key_1 col2_key_2 col3_key_3 col3_key_4
0    a         1a         2a         1a         2a
1    b         1b         2b         1a         2a
2    c         1c         2c         1a         2a

CodePudding user response:

I would use the df[col].apply(pd.Series) method to achieve this. It would then look something like this:

def explode_dictcol(df, col):
    temp = df[col].apply(pd.Series)
    temp = temp.rename(columns={cc: col   '_'   cc for cc in temp.columns})
    return temp

df = pd.concat([df, explode_dictcol(df, 'col2'), explode_dictcol(df, 'col3')], axis=1)
df = df.drop(columns=['col2', 'col3]

col1 col2_key_1 col2_key_2 col3_key_3 col3_key_4
0    a         1a         2a         1a         2a
1    b         1b         2b         1b         2b
2    c         1c         2c         1c         2c

CodePudding user response:

If the columns are strings, the following will do the work

df_new = pd.DataFrame(data = [
    [row['col1'],
    row['col2'].split(':')[1].split(',')[0].strip(),
    row['col2'].split(':')[2].split('}')[0].strip(),
    row['col3'].split(':')[1].split(',')[0].strip(),
    row['col3'].split(':')[2].split('}')[0].strip()]
    for index, row in df.iterrows()
    ]).rename(columns = {0: 'col1', 1: 'col2_key_1', 2: 'col2_key_2', 3: 'col3_key_3', 4: 'col3_key_4'})

[Out]:

  col1 col2_key_1 col2_key_2 col3_key_3 col3_key_4
0    a         1a         2a         1a         2a
1    b         1b         2b         1a         2a
2    c         1c         2c         1a         2a

Notes:

  • Check the data type with

    print(type(df['col2'][0]))
    
    # or
    
    print(type(df['col2'].iat[0]))
    
  • The first part of the proposed solution

    df_new = pd.DataFrame(data = [
        [row['col1'], row['col2'].split(':')[1].split(',')[0].strip(),
        row['col2'].split(':')[2].split('}')[0].strip(),
        row['col3'].split(':')[1].split(',')[0].strip(),
        row['col3'].split(':')[2].split('}')[0].strip()]
        for index, row in df.iterrows()
        ])
    

    gives the following output

       0   1   2   3   4
    0  a  1a  2a  1a  2a
    1  b  1b  2b  1a  2a
    2  c  1c  2c  1a  2a
    

    which is almost the same, but that is why one has to pass .rename() to make sure the column names are as OP wants.

  • Related