I'm using Python 3.7.7 with Pandas 1.1.3.
I have two Pandas databases df1
and df2
with identical column names which can be listed by list(df1.columns)
or list(df2.columns)
. Their length is len(list(df1.columns)) = 200
.
I would like to look for all rows in df2
where the frist 10 values match the values in the i
'th row of df1
.
I am aware this can be done the following way:
colnames = list(df1.columns[:10]) # lists the first 10 column names
toFind = list(df1.iloc[i][:10]) # lists the first 10 values in the i'th row of df1
df2.loc[(df2[colnames[0]] == toFind[0]) & (df2[colnames[1]] == toFind[1]) & (df2[colnames[2]] == toFind[2]) & (df2[colnames[3]] == toFind[3]) & (df2[colnames[4]] == toFind[4]) & (df2[colnames[5]] == toFind[5]) & (df2[colnames[6]] == toFind[6]) & (df2[colnames[7]] == toFind[7]) & (df2[colnames[8]] == toFind[8]) & (df2[colnames[9]] == toFind[9])]
However this is extremely long and difficult to change. If I want to filter by a different number of matching indices (other than 10), I'd have to rewrite the whole expression every time.
Is there a way to do this automatically? I'm looking for something like
colnames = list(df1.columns[:10])
toFind = list(df1.iloc[i][:10])
df2.loc[(df2[colnames[t]] == toFind[t]) for t in range(len(colnames))]
However this obviously returns an error, as nothing like this is implemented in Pandas.
Example: (for simplicity's sake, I'll use 3 instead of 10, and set i=0
)
df1:
i col1 col2 col3 col4 col5 col6 col7
0 1 7 3 4 8 2 4
1 2 5 7 1 4 8 2
2 6 6 8 8 9 1 3
3 7 8 5 2 3 0 9
4 4 0 7 4 5 6 3
5 2 7 6 8 1 7 5
df2:
i col1 col2 col3 col4 col5 col6 col7
0 1 7 3 3 3 4 5
1 2 5 7 5 5 2 5
2 1 7 3 6 6 4 2
3 3 5 7 7 7 6 4
4 1 7 3 8 8 2 3
5 2 5 7 3 9 0 6
Then I would like to select all rows of df2
where the first 3 values are are 1 7 3
(same as the first 3 values of the 0
th row of df1
). That would be
filtered df2 (0):
i col1 col2 col3 col4 col5 col6 col7
0 1 7 3 3 3 4 5
2 1 7 3 6 6 4 2
4 1 7 3 8 8 2 3
As you can see, it filtered for the rows of df2
, where the first 3 values are 1 7 3
If instead i=1
, then I'd like to filter for the rows of df2
where the first 3 values are 2 5 7
(same as the first 3 values of the 1
st row of df1
)
In that case, the output would be
filtered df2 (1):
i col1 col2 col3 col4 col5 col6 col7
1 2 5 7 5 5 2 5
5 2 5 7 3 9 0 6
The aforementioned 3, that controls which columns I filter by is also a parameter, in the previous example, it controled col1 col2 col3
, and in my first explanation it was 10, it controled col1 col2 ... col10
In general, it could be any list of the columns, for example [col5 col7 col8 col15]
.
CodePudding user response:
You can use np.logical_and.reduce
:
i = 0
colnames = list(df1.columns[:3])
toFind = list(df1.iloc[i][:3])
mask = np.logical_and.reduce([(df2[colnames[t]]==toFind[t]) for t in range(len(colnames))])
df = df2[mask]
print (df)
col1 col2 col3 col4 col5 col6 col7
i
0 1 7 3 3 3 4 5
2 1 7 3 6 6 4 2
4 1 7 3 8 8 2 3
Another idea is use:
i = 0
colnames = list(df1.columns[:3])
toFind = list(df1.iloc[i][:3])
df = df2[df2[colnames].eq(toFind).all(axis=1)]
print (df)
col1 col2 col3 col4 col5 col6 col7
i
0 1 7 3 3 3 4 5
2 1 7 3 6 6 4 2
4 1 7 3 8 8 2 3