Pandas Pivot Table, How to create new variables-CodePudding

I am working with pandas in python.

I have the following data frame:

ID | Time | X_mean | Y_mean | status
 1    1      0.1      0.6      0
 1    2      0.2      0.7      0
 1    3      0.3      0.8      0
 2    1      0.6      0.3      1
 2    2      0.2      0.5      1
 2    3      0.3      0.6      1
 .    .      .        .        .
 .    .      .        .        .
 .    .      .        .        .

I would like to create the following dataframe:

ID | X_mean_1 | X_mean_2 | X_mean_3 | Y_mean_1 | Y_mean_2 | Y_mean_3 | status
 1       .        .          .          .           .          .
 2       .        .          .          .           .          .

I tried to use the pivot command in various different forms, but nothing works. In stata I would just use the following command:

reshape wide X_mean Y_mean, i(ID) j(Time)

Is there a way to do the same in pandas?

Thank you for your help?

CodePudding user response：

IIUC use DataFrame.pivot with flatten MultiIndex and add status to last column:

df1 = df.pivot(index=['ID','status'], columns='Time', values=['X_mean','Y_mean'])
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
df1 = df1.reset_index().assign(status = lambda x: x.pop('status'))
print (df1)
   ID  X_mean_1  X_mean_2  X_mean_3  Y_mean_1  Y_mean_2  Y_mean_3  status
0   1       0.1       0.2       0.3       0.6       0.7       0.8       0
1   2       0.6       0.2       0.3       0.3       0.5       0.6       1

Or if need aggregate duplicates per ID,status,Time use DataFrame.pivot_table:

df1 = df.pivot_table(index=['ID','status'], 
                     columns='Time', 
                     values=['X_mean','Y_mean'],
                     aggfunc='mean')
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
df1 = df1.reset_index().assign(status = lambda x: x.pop('status'))

CodePudding user response：

Another possible solution:

(df.melt(['ID', 'Time'])
 .assign(variable = lambda x:
     [f'{y[0]}_{y[1]}' if (y[0] != 'status') else y[0] 
      for y in zip(x.variable, x.Time)])
 .pivot_table('value', 'ID', 'variable',  aggfunc='first')
 .reset_index().rename_axis(None,axis=1))

Output:

   ID  X_mean_1  X_mean_2  X_mean_3  Y_mean_1  Y_mean_2  Y_mean_3  status
0   1       0.1       0.2       0.3       0.6       0.7       0.8     0.0
1   2       0.6       0.2       0.3       0.3       0.5       0.6     1.0

In case the real dataset is well sorted as it is the minimal one here presented, then another approach could be the following, which is based on numpy.reshape:

(pd.DataFrame(
    df.drop('Time', axis=1).values
    .reshape(-1, 12)).drop([3, 4, 7, 8], axis=1)
 .set_axis(
     ['ID']   [f'{x}_{y}' for y in range(1,4) for x in df.columns[2:-1]]
       ['status'], axis=1).convert_dtypes())

Output:

   ID  X_mean_1  Y_mean_1  X_mean_2  Y_mean_2  X_mean_3  Y_mean_3  status
0   1       0.1       0.6       0.2       0.7       0.3       0.8       0
1   2       0.6       0.3       0.2       0.5       0.3       0.6       1