I am working with pandas in python.
I have the following data frame:
ID | Time | X_mean | Y_mean | status
1 1 0.1 0.6 0
1 2 0.2 0.7 0
1 3 0.3 0.8 0
2 1 0.6 0.3 1
2 2 0.2 0.5 1
2 3 0.3 0.6 1
. . . . .
. . . . .
. . . . .
I would like to create the following dataframe:
ID | X_mean_1 | X_mean_2 | X_mean_3 | Y_mean_1 | Y_mean_2 | Y_mean_3 | status
1 . . . . . .
2 . . . . . .
I tried to use the pivot command in various different forms, but nothing works. In stata I would just use the following command:
reshape wide X_mean Y_mean, i(ID) j(Time)
Is there a way to do the same in pandas?
Thank you for your help?
CodePudding user response:
IIUC use DataFrame.pivot
with flatten MultiIndex
and add status
to last column:
df1 = df.pivot(index=['ID','status'], columns='Time', values=['X_mean','Y_mean'])
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
df1 = df1.reset_index().assign(status = lambda x: x.pop('status'))
print (df1)
ID X_mean_1 X_mean_2 X_mean_3 Y_mean_1 Y_mean_2 Y_mean_3 status
0 1 0.1 0.2 0.3 0.6 0.7 0.8 0
1 2 0.6 0.2 0.3 0.3 0.5 0.6 1
Or if need aggregate duplicates per ID,status,Time
use DataFrame.pivot_table
:
df1 = df.pivot_table(index=['ID','status'],
columns='Time',
values=['X_mean','Y_mean'],
aggfunc='mean')
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
df1 = df1.reset_index().assign(status = lambda x: x.pop('status'))
CodePudding user response:
Another possible solution:
(df.melt(['ID', 'Time'])
.assign(variable = lambda x:
[f'{y[0]}_{y[1]}' if (y[0] != 'status') else y[0]
for y in zip(x.variable, x.Time)])
.pivot_table('value', 'ID', 'variable', aggfunc='first')
.reset_index().rename_axis(None,axis=1))
Output:
ID X_mean_1 X_mean_2 X_mean_3 Y_mean_1 Y_mean_2 Y_mean_3 status
0 1 0.1 0.2 0.3 0.6 0.7 0.8 0.0
1 2 0.6 0.2 0.3 0.3 0.5 0.6 1.0
In case the real dataset is well sorted as it is the minimal one here presented, then another approach could be the following, which is based on numpy.reshape
:
(pd.DataFrame(
df.drop('Time', axis=1).values
.reshape(-1, 12)).drop([3, 4, 7, 8], axis=1)
.set_axis(
['ID'] [f'{x}_{y}' for y in range(1,4) for x in df.columns[2:-1]]
['status'], axis=1).convert_dtypes())
Output:
ID X_mean_1 Y_mean_1 X_mean_2 Y_mean_2 X_mean_3 Y_mean_3 status
0 1 0.1 0.6 0.2 0.7 0.3 0.8 0
1 2 0.6 0.3 0.2 0.5 0.3 0.6 1