I have following dataset with different anomaly detection scores (where 1 is an outlier and 0 is an inlier):
a b c d e
0 0 0 0 0 1
1 0 1 0 0 0
2 1 0 0 0 1
3 0 0 0 0 0
4 0 0 0 0 1
What I want to do is to add another column which basically checks if the row contains 1 value and if yes it also has 1 value in it:
a b c d e result
0 0 0 0 0 1 1
1 0 1 0 0 0 1
2 1 0 0 0 1 1
3 0 0 0 0 0 0
4 0 0 0 0 1 1
I'm sure I'm missing something simple out but I'm not sure what is the most efficient way to do this?
CodePudding user response:
As you only have 0/1s you can use max
, it might be slightly slower than any
but there is no type conversion required, so overall, depending on the frequencies of 0/1, it might be quite fast:
df['result'] = df.max(axis=1)
CodePudding user response:
Use DataFrame.any
:
df['result'] = df.eq(1).any(axis=1).astype(int)
print (df)
a b c d e result
0 0 0 0 0 1 1
1 0 1 0 0 0 1
2 1 0 0 0 1 1
3 0 0 0 0 0 0
4 0 0 0 0 1 1
#50k rows for test
df = pd.concat([df] * 10000, ignore_index=True)
In [109]: %timeit df.any(axis=1).astype(int)
2.48 ms ± 240 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [110]: %timeit df.eq(1).any(axis=1).astype(int)
1.46 ms ± 39.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [111]: %timeit np.any(df.eq(1), axis=1).astype(int)
1.47 ms ± 28.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [112]: %timeit np.where(np.any(df.eq(1), axis=1), 1, 0)
1.5 ms ± 102 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Not idea why, but I got max
:
In [115]: %timeit df.max(axis=1)
2.08 ms ± 66.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [116]: %timeit np.max(df, axis=1)
2.17 ms ± 93.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)