Filter DataFrame for most matches-CodePudding

I have a list (list_to_match = ['a','b','c','d']) and a dataframe like this one below:

Index	One	Two	Three	Four
1	a	b	d	c
2	b	b	d	d
3	a	b	d
4	c	b	c	d
5	a	b	c	g
6	a	b	c
7	a	s	c	f
8	a	f	c
9	a	b
10	a	b	t	d
11	a	b	g
...	...	...	...	...
100	a	b	c	d

My goal would be to filter for the rows with most matches with the list in the corrisponding position (e.g. position 1 in the list has to match column 1, position 2 column 2 etc...). In this specific case, excluding row 100, row 5 and 6 would be the one selected since they match 'a', 'b' and 'c' but if row 100 were to be included row 100 and all the other rows matching all elements would be the selected.

Thanks for your help!

CodePudding user response：

I would use:

list_to_match = ['a','b','c','d']

# compute a mask of identical values
mask = df.eq(list_to_match)
# ensure we match values in order
mask2 = mask.cummin(axis=1).sum(axis=1)

# get the max that is not all equal
out = df[mask2.eq(mask2[mask2.lt(len(list_to_match))].max())]

print(out)

Output:

      One Two Three  Four
Index                    
5       a   b     c     g
6       a   b     c  None

CodePudding user response：

You can iterate over the columns, dropping rows that don't match the corresponding element in the list to match. With a little extra bookkeeping, we stop filtering when an additional filter operation would produce an empty DataFrame:

for colname, item_to_match in zip(df.columns, list_to_match):
    filter_result = df[df[colname] == item_to_match]
    if len(filter_result.index):
        df = filter_result

This outputs:

      One Two Three Four  matches
Index
5       a   b     c    g        3
6       a   b     c  NaN        3

CodePudding user response：

Use

df = df.set_index('Index') #only if it is not already the index
print(df)
#      One Two Three Four
#Index                   
#1       a   b     d    c
#2       b   b     d    d
#3       a   b     d    c
#4       c   b     c    d
#5       a   b     c    g
#6       a   b     c    
#7       a   s     c    f
#8       a   f     c    g
#9       a   b     c    f
#10      a   b     t    d
#11      a   b     g    f
#100     a   b     c    d
s = df.eq(list('abcd')).sum(axis=1)
df_filtered = df.loc[s.eq(s.max())]
print(df_filtered)
#      One Two Three Four
#Index                   
#100     a   b     c    d

Index	One	Two	Three	Four
1	a	b	d	c
2	b	b	d	d
3	a	b	d
4	c	b	c	d
5	a	b	c	g
6	a	b	c
7	a	s	c	f
8	a	f	c
9	a	b
10	a	b	t	d
11	a	b	g
...	...	...	...	...
100	a	b	c	d

Index	One	Two	Three	Four
1	a	b	d	c
2	b	b	d	d
3	a	b	d
4	c	b	c	d
5	a	b	c	g
6	a	b	c
7	a	s	c	f
8	a	f	c
9	a	b
10	a	b	t	d
11	a	b	g
...	...	...	...	...
100	a	b	c	d

Index	One	Two	Three	Four
1	a	b	d	c
2	b	b	d	d
3	a	b	d
4	c	b	c	d
5	a	b	c	g
6	a	b	c
7	a	s	c	f
8	a	f	c
9	a	b
10	a	b	t	d
11	a	b	g
...	...	...	...	...
100	a	b	c	d