I am looking for a way to select pandas rows where either:
all items in a list is present in either one of the columns in that row,
or where a minimum number of list items is present in that row
If I have something simple as this:
test_df=pd.DataFrame([['test1', 'test2', 'test3'],
['test4', 'test5', 'test6'], ['test2', 'test3', 'test2'], ['`test1', 'test2', 'test5']], columns=['col1', 'col2', 'col3'])
col1 col2 col3
0 test1 test2 test3
1 test4 test5 test6
2 test2 test3 test1
3 test1 test2 test5
test_list=['test1', 'test2', 'test3']
First query: all items in the list is present in either columns like this:
col1 col2 col3
0 test1 test2 test3
2 test2 test3 test1
And a query where minimum two items from the list is present in either column like this:
col1 col2 col3
0 test1 test2 test3
2 test2 test3 test1
3 test1 test2 test5
Is there a way to do this?
I can select rows where minimum one is present by this (ugly i know, but works) but can't figure out how to expand this to min. two
test_df[(test_df['col1'].isin(test_list) |(test_df['col2'].isin(test_list)|(test_df['col3'].isin(test_list))]
CodePudding user response:
Use DataFrame.isin
for test membership with DataFrame.all
for test match all values per rows in boolean indexing
:
df = test_df[test_df.isin(test_list).all(axis=1)]
print (df)
col1 col2 col3
0 test1 test2 test3
2 test2 test3 test2
Or you can count matched values with sum
and test if greater or equal by Series.ge
:
N = 2
df = test_df[test_df.isin(test_list).sum(axis=1).ge(N)]
print (df)
col1 col2 col3
0 test1 test2 test3
2 test2 test3 test2
3 test1 test2 test5
If need test columns by list:
cols = ['col1','col2','col3']
df = test_df[test_df[cols].isin(test_list).all(axis=1)]
print (df)
col1 col2 col3
0 test1 test2 test3
2 test2 test3 test2
N = 2
cols = ['col1','col2','col3']
df = test_df[test_df[cols].isin(test_list).sum(axis=1).ge(N)]
print (df)
col1 col2 col3
0 test1 test2 test3
2 test2 test3 test2
3 test1 test2 test5