Check values over rows of pandas dataframe-CodePudding

I have following dataset with different anomaly detection scores (where 1 is an outlier and 0 is an inlier):

   a  b  c  d  e
0  0  0  0  0  1
1  0  1  0  0  0
2  1  0  0  0  1
3  0  0  0  0  0
4  0  0  0  0  1

What I want to do is to add another column which basically checks if the row contains 1 value and if yes it also has 1 value in it:

   a  b  c  d  e  result
0  0  0  0  0  1   1
1  0  1  0  0  0   1
2  1  0  0  0  1   1
3  0  0  0  0  0   0
4  0  0  0  0  1   1

I'm sure I'm missing something simple out but I'm not sure what is the most efficient way to do this?

CodePudding user response：

As you only have 0/1s you can use max, it might be slightly slower than any but there is no type conversion required, so overall, depending on the frequencies of 0/1, it might be quite fast:

df['result'] = df.max(axis=1)

CodePudding user response：

Use DataFrame.any:

df['result'] = df.eq(1).any(axis=1).astype(int)
print (df)
   a  b  c  d  e  result
0  0  0  0  0  1       1
1  0  1  0  0  0       1
2  1  0  0  0  1       1
3  0  0  0  0  0       0
4  0  0  0  0  1       1


#50k rows for test
df = pd.concat([df] * 10000, ignore_index=True)
    
In [109]: %timeit df.any(axis=1).astype(int)
2.48 ms ± 240 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [110]: %timeit df.eq(1).any(axis=1).astype(int)
1.46 ms ± 39.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [111]: %timeit np.any(df.eq(1), axis=1).astype(int)
1.47 ms ± 28.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [112]: %timeit np.where(np.any(df.eq(1), axis=1), 1, 0)
1.5 ms ± 102 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Not idea why, but I got max:

In [115]: %timeit df.max(axis=1)
2.08 ms ± 66.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [116]: %timeit np.max(df, axis=1)
2.17 ms ± 93.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)