I've a pandas dataframe of two variables( Begin and End) for three replicates(R1, R2, R3) each of Control(C) and Treatment(T)
Begin End Expt
2 5 C_R1
2 5 C_R2
2 5 C_R3
2 5 T_R1
2 5 T_R2
2 5 T_R3
4 7 C_R2
4 7 C_R3
4 7 T_R1
4 7 T_R2
4 7 T_R3
I want to pick up those rows only for which all three replicates of both control and treatment totally six were observed, i.e (Begin,End:2,5) and not (Begin,End:4,7) as it has only five observations missing the C_R1.
I've gone through some posts here and tried the following, which works for a small set of sample but I've to test with real data which has around 50K rows
my_df[my_df.groupby(["Begin", "End"])['Expt'].transform('nunique') == 6]
Please let me know if this is OK or if any better technique exists. Thanks
CodePudding user response:
df1
df2 = df1[df1.groupby(['Begin','End'])['Expt'].transform('nunique') == 6]
df2
index | Begin | End | Expt |
---|---|---|---|
0 | 2 | 5 | C_R1 |
1 | 2 | 5 | C_R2 |
2 | 2 | 5 | C_R3 |
3 | 2 | 5 | T_R1 |
4 | 2 | 5 | T_R2 |
5 | 2 | 5 | T_R3 |
CodePudding user response:
df[df.groupby(['Begin', 'End'])['Expt']
.transform(lambda x: (np.unique(x.str.split('_').str[0], return_counts = True)[1] == 3).all())]
Begin End Expt
0 2 5 C_R1
1 2 5 C_R2
2 2 5 C_R3
3 2 5 T_R1
4 2 5 T_R2
5 2 5 T_R3