Home > Mobile >  Selecting pandas dataframe rows with same pair of two column values and different on third for certa
Selecting pandas dataframe rows with same pair of two column values and different on third for certa

Time:06-02

I've a pandas dataframe of two variables( Begin and End) for three replicates(R1, R2, R3) each of Control(C) and Treatment(T)

Begin   End Expt
2   5   C_R1
2   5   C_R2
2   5   C_R3
2   5   T_R1
2   5   T_R2
2   5   T_R3
4   7   C_R2
4   7   C_R3
4   7   T_R1
4   7   T_R2
4   7   T_R3

I want to pick up those rows only for which all three replicates of both control and treatment totally six were observed, i.e (Begin,End:2,5) and not (Begin,End:4,7) as it has only five observations missing the C_R1.

I've gone through some posts here and tried the following, which works for a small set of sample but I've to test with real data which has around 50K rows

my_df[my_df.groupby(["Begin", "End"])['Expt'].transform('nunique') == 6]

Please let me know if this is OK or if any better technique exists. Thanks

CodePudding user response:

df1
df2 = df1[df1.groupby(['Begin','End'])['Expt'].transform('nunique') == 6]
df2
index Begin End Expt
0 2 5 C_R1
1 2 5 C_R2
2 2 5 C_R3
3 2 5 T_R1
4 2 5 T_R2
5 2 5 T_R3

CodePudding user response:

df[df.groupby(['Begin', 'End'])['Expt']
     .transform(lambda x: (np.unique(x.str.split('_').str[0], return_counts = True)[1] == 3).all())]
 
   Begin  End  Expt
0      2    5  C_R1
1      2    5  C_R2
2      2    5  C_R3
3      2    5  T_R1
4      2    5  T_R2
5      2    5  T_R3
  • Related