Home > Software engineering >  Is there a way to remove non unique rows in data frame without using apply?
Is there a way to remove non unique rows in data frame without using apply?

Time:02-20

I have a large data frame with over a million rows where I would like to drop any row that does not contain all unique values within the row itself.

    0   1   2   4   3
0   13  3   2   0   3 # Want to drop 
1   13  72  2   13  1 # Want to drop
2   13  3   2   8   5

Is there a faster way of achieving the same result as the code below?

df[df.apply(lambda x: x.is_unique, axis=1)]
#     0  1  2  4  3
# 2  13  3  2  8  5

CodePudding user response:

Numpy is known to operate significantly faster than Pandas.

So attempt the following code:

nCol = df.shape[1]
df[np.apply_along_axis(lambda row: np.unique(row).size == nCol, 1, df.values)]

My comparison of execution time, using %timeit, indicates that my code is about 3 times faster than yours.

For bigger source DataFrame this difference can be greater. Check on your own and than pass the result in a comment.

By the way: I checked also solution proposed by enke, but it seems to be slower than your code.

  • Related