Home > Net >  Extract values from pandas groupby() into a new dataset combining single values and numpy arrays
Extract values from pandas groupby() into a new dataset combining single values and numpy arrays

Time:07-28

I have a pandas dataframe called df that looks like this

name   test_type   test_number   correct
joe    0           1             1
joe    0           2             0
joe    1           1             0
joe    1           2             1
joe    0           1             1
joe    0           2             1
jim    1           1             0
jim    1           2             1
jim    0           1             0
jim    0           2             1
jim    1           1             0
jim    1           2             0

I want a dataset that groups by name, and extract the mean value of correct by test_type (as a single value) as well as the mean value of correct by test_type and test_number (as a numpy array).

Here is what I need:

name    correct_0    correct_1    correct_0_by_tn    correct_val_1_by_tn
joe     0.75         0.5          [1, 0.5]           [0, 1]
jim     0.5          0.25         [0, 1]             [0, 0.5]

I've been using df.groupby(["name", "test_type"]).correct.mean().reset_index() and df.groupby(["name", "test_type", "test_number"]).correct.mean().reset_index() but I can't manage to 1) extract the mean by test_number as an array like I want to and 2) organize the output in a coherent dataframe.

Thanks in advance.

CodePudding user response:

IIUC, you can use:

A = df.groupby(['name', 'test_type'], sort=False)['correct'].mean().unstack()

B = (df
   .groupby(['name', 'test_type', 'test_number'])['correct'].mean()
   .unstack().agg(list, axis=1).unstack()
)

out = A.join(B.add_suffix('_by_tn')).add_prefix('correct_')

output:

test_type  correct_0  correct_1 correct_0_by_tn correct_1_by_tn
name                                                           
joe             0.75       0.50      [1.0, 0.5]      [0.0, 1.0]
jim             0.50       0.25      [0.0, 1.0]      [0.0, 0.5]

Alternative output:

out = (A
  .join(B.add_suffix('_by_tn'))
  .add_prefix('correct_')
  .rename_axis(columns=None)
  .reset_index()
)

output:

  name  correct_0  correct_1 correct_0_by_tn correct_1_by_tn
0  joe       0.75       0.50      [1.0, 0.5]      [0.0, 1.0]
1  jim       0.50       0.25      [0.0, 1.0]      [0.0, 0.5]
  • Related