Home > Software engineering >  Creating a new column based on the mean of other values in group
Creating a new column based on the mean of other values in group

Time:04-30

I am trying to calculate the mean of other values by excluding the focal company. I know it is a little bit complicated but let me explain:

Such as, assume the below code is my data:

d = {'col1': ["A", "A", "A", "B", "B", "B", "c", "c","c", "d", "d", "d", "e", "e", "e"], 
     'col2': [2015, 2016, 2017, 2015, 2016, 2017, 2015, 2016, 2017, 2015, 2016, 2017, 2015, 2016, 2017], 
     'col3': [10, 20, 25, 10, 12, 14, 8, 9, 10, 50, 60, 70, 40, 50, 60],
     'group':[10, 10, 10, 10, 10, 10, 10, 10, 10, 20, 20, 20, 20, 20,20]} 
df = pd.DataFrame(d)

I would like to get (B C)'s mean for 2015 and add it into a new column in A.2016, by considering df.group. So, we need to take mean by df.group for the previous year, by excluding the focal item.

The result should correspond to this:

d = {'col1': ["A", "A", "A", "B", "B", "B", "c", "c", "c", "d", "d", "d", "e", "e", "e"], 
     'col2': [2015, 2016, 2017, 2015, 2016, 2017, 2015, 2016, 2017, 2015, 2016, 2017, 2015, 2016, 2017], 
     'col3': [10, 20, 25, 10, 12, 14, 8, 9, 10, 50, 60, 70, 40, 50, 60],
     'group':[10, 10, 10, 10, 10, 10, 10, 10, 10, 20, 20, 20, 20, 20,20],  
     'operation':['0', '(B2015 C2015)/2', '(B2016 C2016)/2', '0', '(A2015 C2015)/2', '(A2016 C2016)/2', '0', '(A2015 B2015)/2', '(A2016 B2016)/2',"0", "E2015", "E2016", "0","D2015", "D2016" ], 
     'mean': [nan, 9, 10.5, nan, 9, 14.5, nan, 10, 16, nan, 40, 50, nan, 50, 60]}
output = pd.DataFrame(d)

>>> output 
   col1  col2  col3  group        operation  mean
0     A  2015    10     10              nan   0.0
1     A  2016    20     10  (B2015 C2015)/2   9.0
2     A  2017    25     10  (B2016 C2016)/2  10.5
3     B  2015    10     10                0   0.0
4     B  2016    12     10  (A2015 C2015)/2   9.0
5     B  2017    14     10  (A2016 C2016)/2  14.5
6     c  2015     8     10                0   0.0
7     c  2016     9     10  (A2015 B2015)/2  10.0
8     c  2017    10     10  (A2016 B2016)/2  16.0
9     d  2015    50     20                0   0.0
10    d  2016    60     20            E2015  40.0
11    d  2017    70     20            E2016  50.0
12    e  2015    40     20                0   0.0
13    e  2016    50     20            D2015  50.0
14    e  2017    60     20            D2016  60.0

CodePudding user response:

  1. Compute the means of all other values within each group using a double groupby:
  • sum all the values within the group
  • subtract the current (focal) value
  • divide by one less than the number of items in the group
  1. Assign the shift-ed means to a new column:
means = df.groupby("group").apply(lambda x: x.groupby("col2")["col3"].transform("sum").sub(x["col3"]).div(len(x["col1"].unique())-1)).droplevel(0)

df["mean"] = means.shift().where(df["col1"].eq(df["col1"].shift()),0)

>>> df
   col1  col2  col3  group  mean
0     A  2015    10     10   0.0
1     A  2016    20     10   9.0
2     A  2017    25     10  10.5
3     B  2015    10     10   0.0
4     B  2016    12     10   9.0
5     B  2017    14     10  14.5
6     c  2015     8     10   0.0
7     c  2016     9     10  10.0
8     c  2017    10     10  16.0
9     d  2015    50     20   0.0
10    d  2016    60     20  40.0
11    d  2017    70     20  50.0
12    e  2015    40     20   0.0
13    e  2016    50     20  50.0
14    e  2017    60     20  60.0
  • Related