I have this pandas df
df = pd.DataFrame({'id': [1, 2, 3, 4, 5, 6], 'name': ['steve', 'joey', 'nikolaj', 'buddy', 'chester', 'mateo']})
I would like to compare each row in the name column and generate a df that looks as below:
id name second_name second_id
1 steve joey 2
1 steve nikolaj 3
1 steve buddy 4
1 steve chester 5
1 steve mateo 6
2 joey steve. 1
3. joey. nikolaj. 3
Basically each unique row in the df will repeat itself everytime comparing with each and every other row in that column.
CodePudding user response:
An easy way is a cross merge
with filtering:
out = df.merge(df.add_prefix('second_'), how='cross').query('id != second_id')
Output:
id name second_id second_name
1 1 steve 2 joey
2 1 steve 3 nikolaj
3 1 steve 4 buddy
4 1 steve 5 chester
5 1 steve 6 mateo
6 2 joey 1 steve
8 2 joey 3 nikolaj
9 2 joey 4 buddy
10 2 joey 5 chester
11 2 joey 6 mateo
12 3 nikolaj 1 steve
13 3 nikolaj 2 joey
15 3 nikolaj 4 buddy
16 3 nikolaj 5 chester
17 3 nikolaj 6 mateo
18 4 buddy 1 steve
19 4 buddy 2 joey
20 4 buddy 3 nikolaj
22 4 buddy 5 chester
23 4 buddy 6 mateo
24 5 chester 1 steve
25 5 chester 2 joey
26 5 chester 3 nikolaj
27 5 chester 4 buddy
29 5 chester 6 mateo
30 6 mateo 1 steve
31 6 mateo 2 joey
32 6 mateo 3 nikolaj
33 6 mateo 4 buddy
34 6 mateo 5 chester
generalization
If you want a more generic method, in which you can generate combinations/permutations/etc. and handle more than one combination.
Example with combinations of 2:
from itertools import combinations
pd.concat([df.loc[idx].add_suffix(f'_{i}').reset_index(drop=True)
for i, idx in enumerate(map(list, zip(*combinations(range(len(df)), r=2))), start=1)],
axis=1)
Output:
id_1 name_1 id_2 name_2
0 1 steve 2 joey
1 1 steve 3 nikolaj
2 1 steve 4 buddy
3 1 steve 5 chester
4 1 steve 6 mateo
5 2 joey 3 nikolaj
6 2 joey 4 buddy
7 2 joey 5 chester
8 2 joey 6 mateo
9 3 nikolaj 4 buddy
10 3 nikolaj 5 chester
11 3 nikolaj 6 mateo
12 4 buddy 5 chester
13 4 buddy 6 mateo
14 5 chester 6 mateo
permutations of 3:
from itertools import permutations
print(pd.concat([df.loc[idx].add_suffix(f'_{i}').reset_index(drop=True)
for i, idx in enumerate(map(list, zip(*permutations(range(len(df)), r=3))), start=1)],
axis=1))
Output:
id_1 name_1 id_2 name_2 id_3 name_3
0 1 steve 2 joey 3 nikolaj
1 1 steve 2 joey 4 buddy
2 1 steve 2 joey 5 chester
3 1 steve 2 joey 6 mateo
4 1 steve 3 nikolaj 2 joey
.. ... ... ... ... ... ...
115 6 mateo 4 buddy 5 chester
116 6 mateo 5 chester 1 steve
117 6 mateo 5 chester 2 joey
118 6 mateo 5 chester 3 nikolaj
119 6 mateo 5 chester 4 buddy
[120 rows x 6 columns]
CodePudding user response:
Use cross join with suffixes
parameter and filter by DataFrame.loc
:
df = (df.merge(df, how='cross', suffixes=('','_second'))
.loc[lambda x: x['id'].ne(x['id_second'])])
print (df)
id name id_second name_second
1 1 steve 2 joey
2 1 steve 3 nikolaj
3 1 steve 4 buddy
4 1 steve 5 chester
5 1 steve 6 mateo
6 2 joey 1 steve
8 2 joey 3 nikolaj
9 2 joey 4 buddy
10 2 joey 5 chester
11 2 joey 6 mateo
12 3 nikolaj 1 steve
13 3 nikolaj 2 joey
15 3 nikolaj 4 buddy
16 3 nikolaj 5 chester
17 3 nikolaj 6 mateo
18 4 buddy 1 steve
19 4 buddy 2 joey
20 4 buddy 3 nikolaj
22 4 buddy 5 chester
23 4 buddy 6 mateo
24 5 chester 1 steve
25 5 chester 2 joey
26 5 chester 3 nikolaj
27 5 chester 4 buddy
29 5 chester 6 mateo
30 6 mateo 1 steve
31 6 mateo 2 joey
32 6 mateo 3 nikolaj
33 6 mateo 4 buddy
34 6 mateo 5 chester