I have a large data frame with over a million rows where I would like to drop any row that does not contain all unique values within the row itself.
0 1 2 4 3
0 13 3 2 0 3 # Want to drop
1 13 72 2 13 1 # Want to drop
2 13 3 2 8 5
Is there a faster way of achieving the same result as the code below?
df[df.apply(lambda x: x.is_unique, axis=1)]
# 0 1 2 4 3
# 2 13 3 2 8 5
CodePudding user response:
Numpy is known to operate significantly faster than Pandas.
So attempt the following code:
nCol = df.shape[1]
df[np.apply_along_axis(lambda row: np.unique(row).size == nCol, 1, df.values)]
My comparison of execution time, using %timeit, indicates that my code is about 3 times faster than yours.
For bigger source DataFrame this difference can be greater. Check on your own and than pass the result in a comment.
By the way: I checked also solution proposed by enke, but it seems to be slower than your code.