I am attempting a problem similar to this link here. I was able to get assistance for the first part, but am facing difficulties on the second part of creating the final dataframe.
Assuming a similar dataframe with a few changes:
import pandas
import itertools
mygenes=['ABC1', 'ABC2', 'ABC3', 'ABC4']
df = pandas.DataFrame({'Gene' : ['ABC1', 'ABC2', 'ABC3', 'ABC4','ABC5'],
'base1':[1,2,3,4,5] })
Gene base1
0 ABC1 1
1 ABC2 2
2 ABC3 3
3 ABC4 4
4 ABC5 5
I want to obtain the following final dataframe:
base1 base2
'ABC1', 'ABC2' 1 2
'ABC1', 'ABC3' 1 3
'ABC1', 'ABC4' 1 4
'ABC1', 'ABC5' 1 5
'ABC2', 'ABC3' 2 3
'ABC2', 'ABC4' 2 4
'ABC2', 'ABC5' 2 5
'ABC3', 'ABC4' 3 4
'ABC3', 'ABC5' 3 5
'ABC4', 'ABC5' 4 5
I have been successful in using package itertools
to make the combinations as follows:
mygenes=['ABC1', 'ABC2', 'ABC3', 'ABC4', 'ABC5']
list(combinations(mygenes,2))
The goal is to keep the unique values for each 'Gene' in base1
, and create a new column base2
when I create the combinations.
CodePudding user response:
You can try with cross
merge
out = df.merge(df,how='cross',suffixes = ('_1', '_2')).query('base1_1<base1_2')
Out[50]:
Gene_1 base1_1 Gene_2 base1_2
1 ABC1 1 ABC2 2
2 ABC1 1 ABC3 3
3 ABC1 1 ABC4 4
4 ABC1 1 ABC5 5
7 ABC2 2 ABC3 3
8 ABC2 2 ABC4 4
9 ABC2 2 ABC5 5
13 ABC3 3 ABC4 4
14 ABC3 3 ABC5 5
19 ABC4 4 ABC5 5
After set_index
out = out.set_index(['Gene_1','Gene_2'])
Out[52]:
base1_1 base1_2
Gene_1 Gene_2
ABC1 ABC2 1 2
ABC3 1 3
ABC4 1 4
ABC5 1 5
ABC2 ABC3 2 3
ABC4 2 4
ABC5 2 5
ABC3 ABC4 3 4
ABC5 3 5
ABC4 ABC5 4 5