(Python/Pandas) How to automatically filter rows by multiple conditions (perhaps using list comprehe-CodePudding

I'm using Python 3.7.7 with Pandas 1.1.3.

I have two Pandas databases df1 and df2 with identical column names which can be listed by list(df1.columns) or list(df2.columns). Their length is len(list(df1.columns)) = 200.

I would like to look for all rows in df2 where the frist 10 values match the values in the i'th row of df1.

I am aware this can be done the following way:

colnames = list(df1.columns[:10]) # lists the first 10 column names
toFind = list(df1.iloc[i][:10]) # lists the first 10 values in the i'th row of df1
df2.loc[(df2[colnames[0]] == toFind[0]) & (df2[colnames[1]] == toFind[1]) & (df2[colnames[2]] == toFind[2]) & (df2[colnames[3]] == toFind[3]) & (df2[colnames[4]] == toFind[4]) & (df2[colnames[5]] == toFind[5]) & (df2[colnames[6]] == toFind[6]) & (df2[colnames[7]] == toFind[7]) & (df2[colnames[8]] == toFind[8]) & (df2[colnames[9]] == toFind[9])]

However this is extremely long and difficult to change. If I want to filter by a different number of matching indices (other than 10), I'd have to rewrite the whole expression every time.

Is there a way to do this automatically? I'm looking for something like

colnames = list(df1.columns[:10])
toFind = list(df1.iloc[i][:10])
df2.loc[(df2[colnames[t]] == toFind[t]) for t in range(len(colnames))]

However this obviously returns an error, as nothing like this is implemented in Pandas.

Example: (for simplicity's sake, I'll use 3 instead of 10, and set i=0)

df1:

i  col1  col2  col3  col4  col5  col6  col7
0     1     7     3     4     8     2     4
1     2     5     7     1     4     8     2
2     6     6     8     8     9     1     3
3     7     8     5     2     3     0     9
4     4     0     7     4     5     6     3 
5     2     7     6     8     1     7     5

df2:

i  col1  col2  col3  col4  col5  col6  col7
0     1     7     3     3     3     4     5
1     2     5     7     5     5     2     5
2     1     7     3     6     6     4     2
3     3     5     7     7     7     6     4
4     1     7     3     8     8     2     3 
5     2     5     7     3     9     0     6

Then I would like to select all rows of df2 where the first 3 values are are 1 7 3 (same as the first 3 values of the 0th row of df1). That would be

filtered df2 (0):

i  col1  col2  col3  col4  col5  col6  col7
0     1     7     3     3     3     4     5
2     1     7     3     6     6     4     2
4     1     7     3     8     8     2     3

As you can see, it filtered for the rows of df2, where the first 3 values are 1 7 3

If instead i=1, then I'd like to filter for the rows of df2where the first 3 values are 2 5 7 (same as the first 3 values of the 1st row of df1)

In that case, the output would be

filtered df2 (1):

i  col1  col2  col3  col4  col5  col6  col7
1     2     5     7     5     5     2     5
5     2     5     7     3     9     0     6

The aforementioned 3, that controls which columns I filter by is also a parameter, in the previous example, it controled col1 col2 col3, and in my first explanation it was 10, it controled col1 col2 ... col10

In general, it could be any list of the columns, for example [col5 col7 col8 col15].

CodePudding user response：

You can use np.logical_and.reduce:

i = 0
colnames = list(df1.columns[:3])
toFind = list(df1.iloc[i][:3])

mask = np.logical_and.reduce([(df2[colnames[t]]==toFind[t]) for t in range(len(colnames))])
df = df2[mask]

print (df)
   col1  col2  col3  col4  col5  col6  col7
i                                          
0     1     7     3     3     3     4     5
2     1     7     3     6     6     4     2
4     1     7     3     8     8     2     3

Another idea is use:

i = 0
colnames = list(df1.columns[:3])
toFind = list(df1.iloc[i][:3])
df = df2[df2[colnames].eq(toFind).all(axis=1)]

print (df)
   col1  col2  col3  col4  col5  col6  col7
i                                          
0     1     7     3     3     3     4     5
2     1     7     3     6     6     4     2
4     1     7     3     8     8     2     3