I've been trying to figure out how to compare two columns that share some values between them, but at different rows.
For example
col_index | col_1 | col_2 |
---|---|---|
1 | 12 | 34 |
2 | 16 | 42 |
3 | 58 | 35 |
4 | 99 | 60 |
5 | 2 | 12 |
12 | 35 | 99 |
In the above example, col_1 and col_2 match on several occasions: e.g. values '12' and '99'.
I need to be able to find which rows these match at so that I can get the result of col_index.
What would be the best way to do that?
CodePudding user response:
Simply loop over the values that are present in both columns, using the Series.isin
method
# test data:
a = 12,16,58,99
b = 34,99,35,12
c = 1,2,3,5
d = pd.DataFrame({"col_1":a, "col_2":b, 'col_idx':c})
# col_1 col_2 col_idx
#0 12 34 1
#1 16 99 2
#2 58 35 3
#3 99 12 5
for _,row in d.loc[d.col_1.isin(d.col_2)].iterrows():
val = row.col_1
idx1 = row.col_idx
print(val, idx1, d.query("col_2==%d" % val).col_idx.values)
#12 1 [5]
#99 5 [2]
If your values are strings (instead of integers as in this example), change the query
argument accordingly: query("col_2=='%s'" % val)
.
CodePudding user response:
IIUC only row 2 should be removed from col_index.
You can use np.intersect1d
to find the common values between the two columns and then check if these values are in your columns using isin
:
import numpy as np
common_values = np.intersect1d(df.col_1,df.col_2)
res = df[(df.col_1.isin(common_values))|(df.col_2.isin(common_values))]
res
col_index col_1 col_2
0 1 12 34 # 12
2 3 58 35 # 35
3 4 99 60 # 99
4 5 2 12 # 12
5 12 35 99 # 99
res[['col_index']]
col_index
0 1
2 3
3 4
4 5
5 12