Pandas compute all combinations between two columns just once-CodePudding

Disclaimer this is a simplified example, in the real case I need to compute a heavy cost function avoiding repetition a b == b a counts as duplicated

I have a dataframe with a string column, in this example I simply add them up:

import pandas as pd
data = pd.DataFrame({'people':['foo','bar','baz','qux']})

For every name, I add any other name in the dataframe:

results = []
for _,r in data.iterrows():
    all_people = data['people'].tolist()
    person = r['people']
    results.append({'combination':[person n for n in all_people]})

After some manipulation:

d = pd.DataFrame(results)
d.explode('combination')

I obtain

|    | combination   |
|---:|:--------------|
|  0 | foofoo        |
|  0 | foobar        |
|  0 | foobaz        |
|  0 | fooqux        |
|  1 | barfoo        |
|  1 | barbar        |
|  1 | barbaz        |
|  1 | barqux        |
|  2 | bazfoo        |
|  2 | bazbar        |
|  2 | bazbaz        |
|  2 | bazqux        |
|  3 | quxfoo        |
|  3 | quxbar        |
|  3 | quxbaz        |
|  3 | quxqux        |

For my logic foo bar is the same as far boo and with a very large dataframe that is a problem so ideally if I would like that once I have computed all combinations using foo to remove it from all_people but feels ugly, is there a more pythonic way

CodePudding user response：

Use:

array = data.to_numpy()
rows, columns = np.triu_indices(len(array))
res = pd.DataFrame(array[rows]   array[columns], columns=["combination"])
print(res)

Output

  combination
0      foofoo
1      foobar
2      foobaz
3      fooqux
4      barbar
5      barbaz
6      barqux
7      bazbaz
8      bazqux
9      quxqux

The above solution sums every pair only once and everything is done at numpy level, although the elements are strings.

CodePudding user response：

A more Pythonic method is using the itertools.combinations_with_replacement generator:

from itertools import combinations_with_replacement
d = pd.DataFrame({ 'combinations': [ ''.join(c) for c in \
                    combinations_with_replacement(data['people'], 2) ] })
print(d)

Output:

  combinations
0       foofoo
1       foobar
2       foobaz
3       fooqux
4       barbar
5       barbaz
6       barqux
7       bazbaz
8       bazqux
9       quxqux

You can also compute the all_people list only once outside the loop, and when you loop over it inside you can start at the index of the row you are at, to avoid repetition:

all_people = data['people'].tolist()

results = []
for i, r in data.iterrows():
    person = r['people']
    results.append({'combination':[person n for n in all_people[i:]]})

d = pd.DataFrame(results)
print(d.explode('combination'))

Output:

  combination
0      foofoo
0      foobar
0      foobaz
0      fooqux
1      barbar
1      barbaz
1      barqux
2      bazbaz
2      bazqux
3      quxqux