Home > Software design >  Create a dataframe combination and keep unique column values
Create a dataframe combination and keep unique column values

Time:04-07

I am attempting a problem similar to this link here. I was able to get assistance for the first part, but am facing difficulties on the second part of creating the final dataframe.

Assuming a similar dataframe with a few changes:

import pandas
import itertools

mygenes=['ABC1', 'ABC2', 'ABC3', 'ABC4']

df = pandas.DataFrame({'Gene' : ['ABC1', 'ABC2', 'ABC3', 'ABC4','ABC5'],
                       'base1':[1,2,3,4,5] })

   Gene  base1
0  ABC1      1            
1  ABC2      2          
2  ABC3      3          
3  ABC4      4       
4  ABC5      5    

I want to obtain the following final dataframe:

                 base1  base2   
'ABC1', 'ABC2'    1         2        
'ABC1', 'ABC3'    1         3        
'ABC1', 'ABC4'    1         4   
'ABC1', 'ABC5'    1         5     
'ABC2', 'ABC3'    2         3         
'ABC2', 'ABC4'    2         4         
'ABC2', 'ABC5'    2         5 
'ABC3', 'ABC4'    3         4       
'ABC3', 'ABC5'    3         5    
'ABC4', 'ABC5'    4         5   

I have been successful in using package itertools to make the combinations as follows:

mygenes=['ABC1', 'ABC2', 'ABC3', 'ABC4', 'ABC5']
list(combinations(mygenes,2))

The goal is to keep the unique values for each 'Gene' in base1, and create a new column base2 when I create the combinations.

CodePudding user response:

You can try with cross merge

out = df.merge(df,how='cross',suffixes = ('_1', '_2')).query('base1_1<base1_2')
Out[50]: 
   Gene_1  base1_1 Gene_2  base1_2
1    ABC1        1   ABC2        2
2    ABC1        1   ABC3        3
3    ABC1        1   ABC4        4
4    ABC1        1   ABC5        5
7    ABC2        2   ABC3        3
8    ABC2        2   ABC4        4
9    ABC2        2   ABC5        5
13   ABC3        3   ABC4        4
14   ABC3        3   ABC5        5
19   ABC4        4   ABC5        5

After set_index

out = out.set_index(['Gene_1','Gene_2'])
Out[52]: 
               base1_1  base1_2
Gene_1 Gene_2                  
ABC1   ABC2          1        2
       ABC3          1        3
       ABC4          1        4
       ABC5          1        5
ABC2   ABC3          2        3
       ABC4          2        4
       ABC5          2        5
ABC3   ABC4          3        4
       ABC5          3        5
ABC4   ABC5          4        5
  • Related