I have a dataset in which some columns have lookup values. There are several such columns in the dataset. I need to expand these columns so that the column name consists of the name of the column itself and the keys in the dict.
Example df:
df
col1 col2 col3
a '{key_1: 1a, key_2: 2a}' '{key_3: 1a, key_4: 2a}'
b '{key_1: 1b, key_2: 2b}' '{key_3: 1a, key_4: 2a}'
c '{key_1: 1c, key_2: 2c}' '{key_3: 1a, key_4: 2a}'
Desired result:
df_res
col1 col2_key_1 col2_key_2 col3_key_3 col3_key_4
a 1a 2a 1a 2a
b 1b 2b 1a 2a
c 1c 2c 1a 2a
How can I do that?
CodePudding user response:
If in columns are dictionaries, not strings use list comprehension with json_normalize
:
cols = ['col2','col3']
print (type(df['col2'].iat[0]))
<class 'dict'>
dfs = [pd.json_normalize(df.pop(x)).add_prefix(f'{x}_') for x in cols]
df = df.join(pd.concat(dfs, axis=1))
print (df)
col1 col2_key_1 col2_key_2 col3_key_3 col3_key_4
0 a 1a 2a 1a 2a
1 b 1b 2a 1a 2a
2 c 1c 2a 1a 2a
Solution with strings possible converted to dictionaries:
print (df)
col1 col2 col3
0 a {'key_1': '1a', 'key_2': '2a'} {'key_3': '1a', 'key_4': '2a'}
1 b {'key_1': '1b', 'key_2': '2a'} {'key_3': '1a', 'key_4': '2a'}
2 c {'key_1': '1c', 'key_2': '2a'} {'key_3': '1a', 'key_4': '2a'}
cols = ['col2','col3']
print (type(df['col2'].iat[0]))
<class 'str'>
import ast
dfs = [pd.json_normalize(df.pop(x).apply(ast.literal_eval)).add_prefix(f'{x}_')
for x in cols]
df = df.join(pd.concat(dfs, axis=1))
print (df)
col1 col2_key_1 col2_key_2 col3_key_3 col3_key_4
0 a 1a 2a 1a 2a
1 b 1b 2a 1a 2a
2 c 1c 2a 1a 2a
EDIT: Solution for original format with custom function:
print (df)
col1 col2 col3
0 a '{key_1: 1a, key_2: 2a}' '{key_3: 1a, key_4: 2a}'
1 b '{key_1: 1b, key_2: 2b}' '{key_3: 1a, key_4: 2a}'
2 c '{key_1: 1c, key_2: 2c}' '{key_3: 1a, key_4: 2a}'
cols = ['col2','col3']
print (type(df['col2'].iat[0]))
# <class 'str'>
f = lambda x: dict([x.split(': ') for x in x.strip("{'}").split(', ')])
dfs = [pd.json_normalize(df.pop(x).apply(f)).add_prefix(f'{x}_')
for x in cols]
df = df.join(pd.concat(dfs, axis=1))
print (df)
col1 col2_key_1 col2_key_2 col3_key_3 col3_key_4
0 a 1a 2a 1a 2a
1 b 1b 2b 1a 2a
2 c 1c 2c 1a 2a
CodePudding user response:
I would use the df[col].apply(pd.Series)
method to achieve this. It would then look something like this:
def explode_dictcol(df, col):
temp = df[col].apply(pd.Series)
temp = temp.rename(columns={cc: col '_' cc for cc in temp.columns})
return temp
df = pd.concat([df, explode_dictcol(df, 'col2'), explode_dictcol(df, 'col3')], axis=1)
df = df.drop(columns=['col2', 'col3]
col1 col2_key_1 col2_key_2 col3_key_3 col3_key_4
0 a 1a 2a 1a 2a
1 b 1b 2b 1b 2b
2 c 1c 2c 1c 2c
CodePudding user response:
If the columns are strings, the following will do the work
df_new = pd.DataFrame(data = [
[row['col1'],
row['col2'].split(':')[1].split(',')[0].strip(),
row['col2'].split(':')[2].split('}')[0].strip(),
row['col3'].split(':')[1].split(',')[0].strip(),
row['col3'].split(':')[2].split('}')[0].strip()]
for index, row in df.iterrows()
]).rename(columns = {0: 'col1', 1: 'col2_key_1', 2: 'col2_key_2', 3: 'col3_key_3', 4: 'col3_key_4'})
[Out]:
col1 col2_key_1 col2_key_2 col3_key_3 col3_key_4
0 a 1a 2a 1a 2a
1 b 1b 2b 1a 2a
2 c 1c 2c 1a 2a
Notes:
Check the data type with
print(type(df['col2'][0])) # or print(type(df['col2'].iat[0]))
The first part of the proposed solution
df_new = pd.DataFrame(data = [ [row['col1'], row['col2'].split(':')[1].split(',')[0].strip(), row['col2'].split(':')[2].split('}')[0].strip(), row['col3'].split(':')[1].split(',')[0].strip(), row['col3'].split(':')[2].split('}')[0].strip()] for index, row in df.iterrows() ])
gives the following output
0 1 2 3 4 0 a 1a 2a 1a 2a 1 b 1b 2b 1a 2a 2 c 1c 2c 1a 2a
which is almost the same, but that is why one has to pass
.rename()
to make sure the column names are as OP wants.