I have such a dataframe :
With a minimal example :
d = {'Subject': [1,1,1,1,2,2,3,3,3,3,3,3,3],
'Pattern': [1,1,2,2,3,3,2,2,2,2,2,2,2],
'Time': [0.85, 0.92, 1.03, 1.06, 0.89, 0.85, 1.20, 1.03, 1.25, 100.03, 1.97,0.23,0.64]}
df = pd.DataFrame(data=d)
Where Subject
ranges from 1 to 8 and Pattern
from 1 to 3. I want to create a new column where after grouping by Subject
and Pattern
I apply a function that removes outliers from the Time
list associated to the groupby. Right now I have a solution that works well, but I was wondering if there would be a more elegant solution to it, so that I learn how to interact better with DataFrame. Taking the example, it should output :
Subject Pattern Time Time_2
0 1 1 0.85 0.85
1 1 1 0.92 0.92
2 1 2 1.03 1.03
3 1 2 1.06 1.06
4 2 3 0.89 0.89
5 2 3 0.85 0.85
6 3 2 1.20 1.20
7 3 2 1.03 1.03
8 3 2 1.25 1.25
9 3 2 100.03 0.00 # <---
10 3 2 1.97 1.97
11 3 2 0.23 0.23
12 3 2 0.64 0.64
My current code :
def remove_outliers(arr):
elements = np.array(arr)
mean = np.mean(elements)
sd = np.std(elements)
return [x if (mean - 2 * sd < x < mean 2 * sd) else 0 for x in arr]
df_g = df.groupby(['Subject', 'Pattern'])['Time']
times = []
keys = list(df_g.groups.keys())
for i, l in enumerate(df_g.apply(list)):
times.append((keys[i], remove_outliers(l)))
df['Time_2'] = 0
for k, l in times:
vals = df[(df['Subject'] == k[0]) & (df['Pattern'] == k[1])].index.values
df['Time_2'].iloc[vals] = l
CodePudding user response:
Try this -
- Use groupby
transform
the groups to get GroupWise mean and std for each row. - Next use these series objects to create your
check
condition as per your function. - Next inverse this and use
df.mask
to mask values that lie outside this range, and fill them with 0 instead.
grouper = df.groupby(['Subject', 'Pattern'])['Time']
mean = grouper.transform('mean')
std = grouper.transform('std').fillna(0)
check = (df['Time'] < (mean - 2*std)) | (df['Time'] > (mean 2*std))
df['Time_new'] = df['Time'].mask(check).fillna(0)
print(df)
Subject Pattern Time Time_new
0 1 1 0.85 0.85
1 1 1 0.92 0.92
2 1 2 1.03 1.03
3 1 2 1.06 1.06
4 2 3 0.89 0.89
5 2 3 0.85 0.85
6 3 2 1.20 1.20
7 3 2 1.03 1.03
8 3 2 1.25 1.25
9 3 2 100.03 0.00 #<---
10 3 2 1.97 1.97
11 3 2 0.23 0.23
12 3 2 0.64 0.64
NOTE: Jsut to add the 3std deviation condition is too high a range for your example. Try 2std.