Home > Software design >  How do I select rows in a dataframe based on one value having at least one True value in a different
How do I select rows in a dataframe based on one value having at least one True value in a different

Time:10-18

I currently have a dataframe whose abridged version looks like this:

NP                      significant       NP1       NP2       NP3       NP4 
identifier                                                                    
A1BG~P04217                   False  0.389413  0.407739 -0.393914  0.893496   
A2M~P01023                    False       NaN -0.934943       NaN  0.871183   
A2M~P01023                     True  1.549677       NaN  1.836256       NaN   
AARS1~P49588                  False  0.589909 -0.416170 -1.532500 -0.378543   
ABCB6~Q9NP58                  False -0.999101 -0.506800 -0.530378 -0.794270   
ABCC1~I3L4X2                  False  0.425545  0.645758 -0.834004 -1.027046   
ABCC4~O15439                  False -0.749277 -0.245724 -0.457495 -0.192631   
ABCE1~P61221                  False       NaN -0.689983 -2.438507 -1.237040   
ABCE1~P61221                   True -1.959488       NaN       NaN       NaN  

What I want to do is I want to select rows where each unique identifier has at least one True value in the significant column. So ultimately, I would like my new dataframe to look like this:

NP                  significant       NP1       NP2       NP3       NP4  
identifier                                                                
A2M~P01023                False       NaN -0.934943       NaN  0.871183   
A2M~P01023                 True  1.549677       NaN  1.836256       NaN   
ABCE1~P61221              False       NaN -0.689983 -2.438507 -1.237040   
ABCE1~P61221               True -1.959488       NaN       NaN       NaN 

Does anyone know how this can be done? I am sorry in advance if this question is a duplicate, but I have tried a few different approaches (e.g. using .any() and .groupby() ) and nothing has worked for me so far.

CodePudding user response:

Try with transform

df = df[df.groupby(level=0)['significant'].transform('any')]
Out[17]: 
              significant       NP1       NP2       NP3       NP4
A2M~P01023          False       NaN -0.934943       NaN  0.871183
A2M~P01023           True  1.549677       NaN  1.836256       NaN
ABCE1~P61221        False       NaN -0.689983 -2.438507 -1.237040
ABCE1~P61221         True -1.959488       NaN       NaN       NaN

CodePudding user response:

Let's try groupby and filter

out = df.groupby(df.index).filter(lambda g: g['significant'].any())
print(out)

              significant       NP1       NP2       NP3       NP4
A2M~P01023          False       NaN -0.934943       NaN  0.871183
A2M~P01023           True  1.549677       NaN  1.836256       NaN
ABCE1~P61221        False       NaN -0.689983 -2.438507 -1.237040
ABCE1~P61221         True -1.959488       NaN       NaN       NaN

CodePudding user response:

I illustrate one possible solution. Assume the code is pseudocode (i.e., don't take it too literally; just as a guide to write actual code).

First, find indexes of rows with true values in the "significant" column.

idx = data.significant == True

Next, subset the dataframe with those indexes and extract the corresponding unique identifiers.

IDs = data.iloc[idx, :].identifier

Finally, subset the original dataframe with those unique identifiers.

data.iloc[data.identifier.isin(IDs),:]
  • Related