I want to split a column into multiple column based on a grouped value. For example
# input
df = pd.DataFrame([[1,2,1], [1,4,4], [1,5,7], [2,1,1], [2,3,5], [2,3,1]], columns=['cat', 'v1', 'v2'])
#output
df_out = pd.DataFrame([[1,2,0,1,0],
[1,4,0,4,0],
[1,5,0,7,0],
[2,0,1,0,1],
[2,0,3,0,5],
[2,0,3,0,1]], columns=['cat', 'v1_1', 'v1_2', 'v2_1', 'v2_2'])
I came up with this code, want to get an efficient one since my input file is huge.
grouped = df.groupby(['cat'])
cols = ['v1', 'v2']
for column in cols:
new_name = [column '_' str(k) for k in grouped.groups]
df[new_name] = np.zeros((df[column].size, len(grouped.groups)), dtype=df[column].dtype)
for k,index in grouped.groups.items():
df.loc[index, [c '_' str(k) for c in cols]] = df.loc[index, cols].to_numpy()
In:
cat v1 v2
0 1 2 1
1 1 4 4
2 1 5 7
3 2 1 1
4 2 3 5
5 2 3 1
out:
cat v1_1 v1_2 v2_1 v2_2
0 1 2 0 1 0
1 1 4 0 4 0
2 1 5 0 7 0
3 2 0 1 0 1
4 2 0 3 0 5
5 2 0 3 0 1
CodePudding user response:
Use DataFrame.set_index
with DataFrame.unstack
, flatten MultiIndex
and last add cat
column by DataFrame.join
:
df1 = df.set_index('cat', append=True).unstack(fill_value=0)
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
df1 = df[['cat']].join(df1)
print (df1)
cat v1_1 v1_2 v2_1 v2_2
0 1 2 0 1 0
1 1 4 0 4 0
2 1 5 0 7 0
3 2 0 1 0 1
4 2 0 3 0 5
5 2 0 3 0 1