Home > Software design >  DataFrame Create new column after applying a function on groupby values
DataFrame Create new column after applying a function on groupby values

Time:10-08

I have such a dataframe :

Original Dataframe

With a minimal example :

d = {'Subject': [1,1,1,1,2,2,3,3,3,3,3,3,3], 
    'Pattern': [1,1,2,2,3,3,2,2,2,2,2,2,2], 
    'Time': [0.85, 0.92, 1.03, 1.06, 0.89, 0.85, 1.20, 1.03, 1.25, 100.03, 1.97,0.23,0.64]}
df = pd.DataFrame(data=d)

Where Subject ranges from 1 to 8 and Pattern from 1 to 3. I want to create a new column where after grouping by Subject and Pattern I apply a function that removes outliers from the Time list associated to the groupby. Right now I have a solution that works well, but I was wondering if there would be a more elegant solution to it, so that I learn how to interact better with DataFrame. Taking the example, it should output :

 Subject Pattern Time Time_2
0   1      1    0.85    0.85
1   1      1    0.92    0.92
2   1      2    1.03    1.03
3   1      2    1.06    1.06
4   2      3    0.89    0.89
5   2      3    0.85    0.85
6   3      2    1.20    1.20
7   3      2    1.03    1.03
8   3      2    1.25    1.25
9   3      2    100.03  0.00 # <---
10  3      2    1.97    1.97
11  3      2    0.23    0.23
12  3      2    0.64    0.64

My current code :

def remove_outliers(arr):
    elements = np.array(arr)

    mean = np.mean(elements)
    sd = np.std(elements)
    
    return [x if (mean - 2 * sd < x <  mean   2 * sd) else 0 for x in arr]

df_g = df.groupby(['Subject', 'Pattern'])['Time']

times = []
keys = list(df_g.groups.keys())
for i, l in enumerate(df_g.apply(list)):
    times.append((keys[i], remove_outliers(l)))
    
df['Time_2'] = 0
for k, l in times:
    vals = df[(df['Subject'] == k[0]) & (df['Pattern'] == k[1])].index.values
    df['Time_2'].iloc[vals] = l

CodePudding user response:

Try this -

  1. Use groupby transform the groups to get GroupWise mean and std for each row.
  2. Next use these series objects to create your check condition as per your function.
  3. Next inverse this and use df.mask to mask values that lie outside this range, and fill them with 0 instead.
grouper = df.groupby(['Subject', 'Pattern'])['Time']

mean = grouper.transform('mean')
std = grouper.transform('std').fillna(0)

check = (df['Time'] < (mean - 2*std)) | (df['Time'] > (mean   2*std))

df['Time_new'] = df['Time'].mask(check).fillna(0)
print(df)
    Subject  Pattern    Time  Time_new
0         1        1    0.85      0.85
1         1        1    0.92      0.92
2         1        2    1.03      1.03
3         1        2    1.06      1.06
4         2        3    0.89      0.89
5         2        3    0.85      0.85
6         3        2    1.20      1.20
7         3        2    1.03      1.03
8         3        2    1.25      1.25
9         3        2  100.03      0.00   #<---
10        3        2    1.97      1.97
11        3        2    0.23      0.23
12        3        2    0.64      0.64

NOTE: Jsut to add the 3std deviation condition is too high a range for your example. Try 2std.

  • Related