pandas selection from list-CodePudding

I am looking for a way to select pandas rows where either:

all items in a list is present in either one of the columns in that row,

or where a minimum number of list items is present in that row

If I have something simple as this:

test_df=pd.DataFrame([['test1', 'test2', 'test3'], 
['test4', 'test5', 'test6'], ['test2', 'test3', 'test2'], ['`test1', 'test2', 'test5']], columns=['col1', 'col2', 'col3'])

col1    col2    col3
0   test1   test2   test3
1   test4   test5   test6
2   test2   test3   test1
3   test1   test2   test5

test_list=['test1', 'test2', 'test3']

First query: all items in the list is present in either columns like this:

    col1    col2    col3
0   test1   test2   test3
2   test2   test3   test1

And a query where minimum two items from the list is present in either column like this:

    col1    col2    col3
0   test1   test2   test3
2   test2   test3   test1
3   test1   test2   test5

Is there a way to do this?

I can select rows where minimum one is present by this (ugly i know, but works) but can't figure out how to expand this to min. two

test_df[(test_df['col1'].isin(test_list) |(test_df['col2'].isin(test_list)|(test_df['col3'].isin(test_list))]

CodePudding user response：

Use DataFrame.isin for test membership with DataFrame.all for test match all values per rows in boolean indexing:

df = test_df[test_df.isin(test_list).all(axis=1)]
print (df)
    col1   col2   col3
0  test1  test2  test3
2  test2  test3  test2

Or you can count matched values with sum and test if greater or equal by Series.ge:

N = 2
df = test_df[test_df.isin(test_list).sum(axis=1).ge(N)]
print (df)
    col1   col2   col3
0  test1  test2  test3
2  test2  test3  test2
3  test1  test2  test5

If need test columns by list:

cols = ['col1','col2','col3']
df = test_df[test_df[cols].isin(test_list).all(axis=1)]
print (df)
    col1   col2   col3
0  test1  test2  test3
2  test2  test3  test2

N = 2
cols = ['col1','col2','col3']
df = test_df[test_df[cols].isin(test_list).sum(axis=1).ge(N)]
print (df)
    col1   col2   col3
0  test1  test2  test3
2  test2  test3  test2
3  test1  test2  test5