Home > Software design >  Compare rows in a pandas df and generate new columns
Compare rows in a pandas df and generate new columns

Time:01-31

I have this pandas df

df = pd.DataFrame({'id': [1, 2, 3, 4, 5, 6], 'name': ['steve', 'joey', 'nikolaj', 'buddy', 'chester', 'mateo']})

I would like to compare each row in the name column and generate a df that looks as below:

id    name    second_name    second_id
1     steve   joey           2
1     steve   nikolaj        3
1     steve   buddy          4
1     steve   chester        5
1     steve   mateo          6
2     joey    steve.         1
3.    joey.   nikolaj.       3

Basically each unique row in the df will repeat itself everytime comparing with each and every other row in that column.

CodePudding user response:

An easy way is a cross merge with filtering:

out = df.merge(df.add_prefix('second_'), how='cross').query('id != second_id')

Output:

    id     name  second_id second_name
1    1    steve          2        joey
2    1    steve          3     nikolaj
3    1    steve          4       buddy
4    1    steve          5     chester
5    1    steve          6       mateo
6    2     joey          1       steve
8    2     joey          3     nikolaj
9    2     joey          4       buddy
10   2     joey          5     chester
11   2     joey          6       mateo
12   3  nikolaj          1       steve
13   3  nikolaj          2        joey
15   3  nikolaj          4       buddy
16   3  nikolaj          5     chester
17   3  nikolaj          6       mateo
18   4    buddy          1       steve
19   4    buddy          2        joey
20   4    buddy          3     nikolaj
22   4    buddy          5     chester
23   4    buddy          6       mateo
24   5  chester          1       steve
25   5  chester          2        joey
26   5  chester          3     nikolaj
27   5  chester          4       buddy
29   5  chester          6       mateo
30   6    mateo          1       steve
31   6    mateo          2        joey
32   6    mateo          3     nikolaj
33   6    mateo          4       buddy
34   6    mateo          5     chester

generalization

If you want a more generic method, in which you can generate combinations/permutations/etc. and handle more than one combination.

Example with combinations of 2:

from itertools import combinations

pd.concat([df.loc[idx].add_suffix(f'_{i}').reset_index(drop=True)
           for i, idx in enumerate(map(list, zip(*combinations(range(len(df)), r=2))), start=1)],
          axis=1)

Output:

    id_1   name_1  id_2   name_2
0      1    steve     2     joey
1      1    steve     3  nikolaj
2      1    steve     4    buddy
3      1    steve     5  chester
4      1    steve     6    mateo
5      2     joey     3  nikolaj
6      2     joey     4    buddy
7      2     joey     5  chester
8      2     joey     6    mateo
9      3  nikolaj     4    buddy
10     3  nikolaj     5  chester
11     3  nikolaj     6    mateo
12     4    buddy     5  chester
13     4    buddy     6    mateo
14     5  chester     6    mateo

permutations of 3:

from itertools import permutations

print(pd.concat([df.loc[idx].add_suffix(f'_{i}').reset_index(drop=True)
           for i, idx in enumerate(map(list, zip(*permutations(range(len(df)), r=3))), start=1)],
          axis=1))

Output:

     id_1 name_1  id_2   name_2  id_3   name_3
0       1  steve     2     joey     3  nikolaj
1       1  steve     2     joey     4    buddy
2       1  steve     2     joey     5  chester
3       1  steve     2     joey     6    mateo
4       1  steve     3  nikolaj     2     joey
..    ...    ...   ...      ...   ...      ...
115     6  mateo     4    buddy     5  chester
116     6  mateo     5  chester     1    steve
117     6  mateo     5  chester     2     joey
118     6  mateo     5  chester     3  nikolaj
119     6  mateo     5  chester     4    buddy

[120 rows x 6 columns]

CodePudding user response:

Use cross join with suffixes parameter and filter by DataFrame.loc:

df  = (df.merge(df, how='cross', suffixes=('','_second'))
         .loc[lambda x: x['id'].ne(x['id_second'])])

print (df)
    id     name  id_second name_second
1    1    steve          2        joey
2    1    steve          3     nikolaj
3    1    steve          4       buddy
4    1    steve          5     chester
5    1    steve          6       mateo
6    2     joey          1       steve
8    2     joey          3     nikolaj
9    2     joey          4       buddy
10   2     joey          5     chester
11   2     joey          6       mateo
12   3  nikolaj          1       steve
13   3  nikolaj          2        joey
15   3  nikolaj          4       buddy
16   3  nikolaj          5     chester
17   3  nikolaj          6       mateo
18   4    buddy          1       steve
19   4    buddy          2        joey
20   4    buddy          3     nikolaj
22   4    buddy          5     chester
23   4    buddy          6       mateo
24   5  chester          1       steve
25   5  chester          2        joey
26   5  chester          3     nikolaj
27   5  chester          4       buddy
29   5  chester          6       mateo
30   6    mateo          1       steve
31   6    mateo          2        joey
32   6    mateo          3     nikolaj
33   6    mateo          4       buddy
34   6    mateo          5     chester
  • Related