I have a list of values like this:
l = [0,1,1,1,0,0,1,0,1,0]
and I'm trying to find near duplicate rows(with one or two digit difference) in a dataframe like below:
But please keep in mind that they are many more rows and column and this is just a sample dataframe
df = pd.DataFrame({'a': [0, 1, 0], 'b': [0, 1, 0], 'c': [1, 1, 0], 'd': [1, 0, 1], 'e': [1, 1, 0],
'f': [0, 1, 1], 'g': [0, 1, 0], 'h': [1, 1, 0], 'i': [1, 1, 0], 'j': [0, 1, 1]},
index=['x', 'y', 'z'])
a b c d e f g h i j
x 0 0 1 1 1 0 0 1 1 0
y 1 1 1 0 1 1 1 1 1 1
z 0 0 0 1 0 1 0 0 0 1
any suggestion would be appreciated
CodePudding user response:
You can use df.eq(l).sum(axis=1)
to compute the number of (aligned) common elements with your list:
l = [0,1,1,1,0,0,1,0,1,0]
df.eq(l).sum(axis=1)
x 6
y 4
z 4
dtype: int64
To filter with a threshold, use:
diff = 4
df[df.eq(l).sum(axis=1).ge(len(l)-diff)]
output:
a b c d e f g h i j
x 0 0 1 1 1 0 0 1 1 0