Disclaimer this is a simplified example, in the real case I need to compute a heavy cost function avoiding repetition a b == b a counts as duplicated
I have a dataframe with a string column, in this example I simply add them up:
import pandas as pd
data = pd.DataFrame({'people':['foo','bar','baz','qux']})
For every name, I add any other name in the dataframe:
results = []
for _,r in data.iterrows():
all_people = data['people'].tolist()
person = r['people']
results.append({'combination':[person n for n in all_people]})
After some manipulation:
d = pd.DataFrame(results)
d.explode('combination')
I obtain
| | combination |
|---:|:--------------|
| 0 | foofoo |
| 0 | foobar |
| 0 | foobaz |
| 0 | fooqux |
| 1 | barfoo |
| 1 | barbar |
| 1 | barbaz |
| 1 | barqux |
| 2 | bazfoo |
| 2 | bazbar |
| 2 | bazbaz |
| 2 | bazqux |
| 3 | quxfoo |
| 3 | quxbar |
| 3 | quxbaz |
| 3 | quxqux |
For my logic foo bar is the same as far boo and with a very large dataframe that is a problem so ideally if I would like that once I have computed all combinations using foo to remove it from all_people but feels ugly, is there a more pythonic way
CodePudding user response:
Use:
array = data.to_numpy()
rows, columns = np.triu_indices(len(array))
res = pd.DataFrame(array[rows] array[columns], columns=["combination"])
print(res)
Output
combination
0 foofoo
1 foobar
2 foobaz
3 fooqux
4 barbar
5 barbaz
6 barqux
7 bazbaz
8 bazqux
9 quxqux
The above solution sums every pair only once and everything is done at numpy level, although the elements are strings.
CodePudding user response:
A more Pythonic method is using the itertools.combinations_with_replacement
generator:
from itertools import combinations_with_replacement
d = pd.DataFrame({ 'combinations': [ ''.join(c) for c in \
combinations_with_replacement(data['people'], 2) ] })
print(d)
Output:
combinations
0 foofoo
1 foobar
2 foobaz
3 fooqux
4 barbar
5 barbaz
6 barqux
7 bazbaz
8 bazqux
9 quxqux
You can also compute the all_people
list only once outside the loop, and when you loop over it inside you can start at the index of the row you are at, to avoid repetition:
all_people = data['people'].tolist()
results = []
for i, r in data.iterrows():
person = r['people']
results.append({'combination':[person n for n in all_people[i:]]})
d = pd.DataFrame(results)
print(d.explode('combination'))
Output:
combination
0 foofoo
0 foobar
0 foobaz
0 fooqux
1 barbar
1 barbaz
1 barqux
2 bazbaz
2 bazqux
3 quxqux