Home > OS >  For Loop in a Data Frame to find near duplicate rows
For Loop in a Data Frame to find near duplicate rows

Time:05-02

I have a list of values like this:

l = [0,1,1,1,0,0,1,0,1,0]

and I'm trying to find near duplicate rows(with one or two digit difference) in a dataframe like below:

But please keep in mind that they are many more rows and column and this is just a sample dataframe

df = pd.DataFrame({'a': [0, 1, 0], 'b': [0, 1, 0], 'c': [1, 1, 0], 'd': [1, 0, 1], 'e': [1, 1, 0],
                   'f': [0, 1, 1], 'g': [0, 1, 0], 'h': [1, 1, 0], 'i': [1, 1, 0], 'j': [0, 1, 1]},
                 index=['x', 'y', 'z'])

   a  b  c  d  e  f  g  h  i  j
x  0  0  1  1  1  0  0  1  1  0
y  1  1  1  0  1  1  1  1  1  1
z  0  0  0  1  0  1  0  0  0  1

any suggestion would be appreciated

CodePudding user response:

You can use df.eq(l).sum(axis=1) to compute the number of (aligned) common elements with your list:

l = [0,1,1,1,0,0,1,0,1,0]
df.eq(l).sum(axis=1)

x    6
y    4
z    4
dtype: int64

To filter with a threshold, use:

diff = 4
df[df.eq(l).sum(axis=1).ge(len(l)-diff)]

output:

   a  b  c  d  e  f  g  h  i  j
x  0  0  1  1  1  0  0  1  1  0
  • Related