I have an input pd dataframe with two columns, one is the sequence and the second is an ID (it is a number between 1-1000). I want to get all the possible combinations between the sequences that have the same ID.
Input:
sequence ID
CASSSTGVLLYEQCF 1
CASSSTGVLLYEQYF 1
CAFNAGGTSHGKLTF 2
CAFNAGGTSYGKLTF 2
CAINAGGTSYGKLTF 2
CANSPSPVAGTDTQYF 3
CASSPSPVAGTDTQYF 3
desired output
CASSSTGVLLYEQCF CASSSTGVLLYEQYF
CAFNAGGTSHGKLTF CAFNAGGTSYGKLTF
CAFNAGGTSYGKLTF CAINAGGTSYGKLTF
CAINAGGTSYGKLTF CAFNAGGTSHGKLTF
CANSPSPVAGTDTQYF CASSPSPVAGTDTQYF
I have been reading into itertools but this only gives me all possible combinations without using the ID. Does anyone know how this can be done using python or has any tips for me where I can look?
CodePudding user response:
Use custom lambda function with itertools.combinations
per groups in GroupBy.apply
:
from itertools import combinations
df1 = df.groupby('ID')['sequence'].apply(lambda x: pd.DataFrame(combinations(x, 2),
columns=['a','b']))
print (df1)
a b
ID
1 0 CASSSTGVLLYEQCF CASSSTGVLLYEQYF
2 0 CAFNAGGTSHGKLTF CAFNAGGTSYGKLTF
1 CAFNAGGTSHGKLTF CAINAGGTSYGKLTF
2 CAFNAGGTSYGKLTF CAINAGGTSYGKLTF
3 0 CANSPSPVAGTDTQYF CASSPSPVAGTDTQYF
df1 = df1.droplevel(1).reset_index()
print (df1)
ID a b
0 1 CASSSTGVLLYEQCF CASSSTGVLLYEQYF
1 2 CAFNAGGTSHGKLTF CAFNAGGTSYGKLTF
2 2 CAFNAGGTSHGKLTF CAINAGGTSYGKLTF
3 2 CAFNAGGTSYGKLTF CAINAGGTSYGKLTF
4 3 CANSPSPVAGTDTQYF CASSPSPVAGTDTQYF