I have a list (list_to_match = ['a','b','c','d']
) and a dataframe like this one below:
Index | One | Two | Three | Four |
---|---|---|---|---|
1 | a | b | d | c |
2 | b | b | d | d |
3 | a | b | d | |
4 | c | b | c | d |
5 | a | b | c | g |
6 | a | b | c | |
7 | a | s | c | f |
8 | a | f | c | |
9 | a | b | ||
10 | a | b | t | d |
11 | a | b | g | |
... | ... | ... | ... | ... |
100 | a | b | c | d |
My goal would be to filter for the rows with most matches with the list in the corrisponding position (e.g. position 1 in the list has to match column 1, position 2 column 2 etc...). In this specific case, excluding row 100, row 5 and 6 would be the one selected since they match 'a', 'b' and 'c' but if row 100 were to be included row 100 and all the other rows matching all elements would be the selected.
Thanks for your help!
CodePudding user response:
I would use:
list_to_match = ['a','b','c','d']
# compute a mask of identical values
mask = df.eq(list_to_match)
# ensure we match values in order
mask2 = mask.cummin(axis=1).sum(axis=1)
# get the max that is not all equal
out = df[mask2.eq(mask2[mask2.lt(len(list_to_match))].max())]
print(out)
Output:
One Two Three Four
Index
5 a b c g
6 a b c None
CodePudding user response:
You can iterate over the columns, dropping rows that don't match the corresponding element in the list to match. With a little extra bookkeeping, we stop filtering when an additional filter operation would produce an empty DataFrame:
for colname, item_to_match in zip(df.columns, list_to_match):
filter_result = df[df[colname] == item_to_match]
if len(filter_result.index):
df = filter_result
This outputs:
One Two Three Four matches
Index
5 a b c g 3
6 a b c NaN 3
CodePudding user response:
Use
df = df.set_index('Index') #only if it is not already the index
print(df)
# One Two Three Four
#Index
#1 a b d c
#2 b b d d
#3 a b d c
#4 c b c d
#5 a b c g
#6 a b c
#7 a s c f
#8 a f c g
#9 a b c f
#10 a b t d
#11 a b g f
#100 a b c d
s = df.eq(list('abcd')).sum(axis=1)
df_filtered = df.loc[s.eq(s.max())]
print(df_filtered)
# One Two Three Four
#Index
#100 a b c d