I currently have a dataframe whose abridged version looks like this:
NP significant NP1 NP2 NP3 NP4
identifier
A1BG~P04217 False 0.389413 0.407739 -0.393914 0.893496
A2M~P01023 False NaN -0.934943 NaN 0.871183
A2M~P01023 True 1.549677 NaN 1.836256 NaN
AARS1~P49588 False 0.589909 -0.416170 -1.532500 -0.378543
ABCB6~Q9NP58 False -0.999101 -0.506800 -0.530378 -0.794270
ABCC1~I3L4X2 False 0.425545 0.645758 -0.834004 -1.027046
ABCC4~O15439 False -0.749277 -0.245724 -0.457495 -0.192631
ABCE1~P61221 False NaN -0.689983 -2.438507 -1.237040
ABCE1~P61221 True -1.959488 NaN NaN NaN
What I want to do is I want to select rows where each unique identifier has at least one True value in the significant
column. So ultimately, I would like my new dataframe to look like this:
NP significant NP1 NP2 NP3 NP4
identifier
A2M~P01023 False NaN -0.934943 NaN 0.871183
A2M~P01023 True 1.549677 NaN 1.836256 NaN
ABCE1~P61221 False NaN -0.689983 -2.438507 -1.237040
ABCE1~P61221 True -1.959488 NaN NaN NaN
Does anyone know how this can be done? I am sorry in advance if this question is a duplicate, but I have tried a few different approaches (e.g. using .any() and .groupby() ) and nothing has worked for me so far.
CodePudding user response:
Try with transform
df = df[df.groupby(level=0)['significant'].transform('any')]
Out[17]:
significant NP1 NP2 NP3 NP4
A2M~P01023 False NaN -0.934943 NaN 0.871183
A2M~P01023 True 1.549677 NaN 1.836256 NaN
ABCE1~P61221 False NaN -0.689983 -2.438507 -1.237040
ABCE1~P61221 True -1.959488 NaN NaN NaN
CodePudding user response:
Let's try groupby
and filter
out = df.groupby(df.index).filter(lambda g: g['significant'].any())
print(out)
significant NP1 NP2 NP3 NP4
A2M~P01023 False NaN -0.934943 NaN 0.871183
A2M~P01023 True 1.549677 NaN 1.836256 NaN
ABCE1~P61221 False NaN -0.689983 -2.438507 -1.237040
ABCE1~P61221 True -1.959488 NaN NaN NaN
CodePudding user response:
I illustrate one possible solution. Assume the code is pseudocode (i.e., don't take it too literally; just as a guide to write actual code).
First, find indexes of rows with true values in the "significant" column.
idx = data.significant == True
Next, subset the dataframe with those indexes and extract the corresponding unique identifiers.
IDs = data.iloc[idx, :].identifier
Finally, subset the original dataframe with those unique identifiers.
data.iloc[data.identifier.isin(IDs),:]