I am a beginner of python and I have no idea about how to resolve this problem:
I have DNA sequences of 60 nucleotides save in a csv format, where each nucleotide is replaced by 3 binary indicator variables as follows:
A -> 1 0 0
C -> 0 1 0
G -> 0 0 1
T -> 0 0 0
So I have a matrix of 60*3 columns and ~3000 rows.
I would like to go back and get the original DNA sequences. How can I do this?
I tried the following code but it crashed:
mm_output = pd.DataFrame(columns=['colname_' str(i) for i in range(60)], index=range(X.shape[0]))
for i in range(X.shape[0]):
for j in np.arange(0, 180, 3):
X_sub=X.iloc[i,j:j 3]
j_tmp=int(j/3)
if np.array_equal(X_sub.values,A):
mm_output.iloc[i,j_tmp]='A'
elif np.array_equal(X_sub.values,C):
mm_output.iloc[i,j_tmp]='C'
elif np.array_equal(X_sub.values,G):
mm_output.iloc[i,j_tmp]='G'
elif np.array_equal(X_sub.values,T):
mm_output.iloc[i,j_tmp]='T'
The dataset was collected here: https://www.openml.org/d/40670
Thank you in advance
CodePudding user response:
It looks like you have the data in X
, and based on the file that you linked to, X
has shape (3186, 180). Here's a way you can convert that to an array with shape (3186, 60) containig the characters 'A', 'C', 'G' and 'T'.
First, note that you can convert each triple to an integer index by dotting it (i.e. matrix-multiplying it) with [1, 2, 3]. For example,
In [42]: a = np.array([0, 1, 0])
In [43]: a @ [1, 2, 3] # Convert `a` to the value 2.
Out[43]: 2
In [44]: a = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1], [0, 0, 0]])
In [45]: a
Out[45]:
array([[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[0, 0, 0]])
In [46]: a @ [1, 2, 3] # Convert each row of `a` to a single integer.
Out[46]: array([1, 2, 3, 0])
The integers produces by that matrix product can used as an index into an array to convert them to nucleotide characters. E.g.
In [47]: nucs = np.array(['T', 'A', 'C', 'G'])
In [48]: nucs[[1, 2, 3, 0]]
Out[48]: array(['A', 'C', 'G', 'T'], dtype='<U1')
To apply these techniques to your array X
, the array must be reshaped from (3186, 180) to (3186, 60, 3). (The use of -1 in the following tells reshape
to figure out that length based on the other values. You could also put 60 there.)
In [49]: X3 = X.reshape(X.shape[0], -1, 3)
In [50]: X3.shape
Out[50]: (3186, 60, 3)
Now we can apply the matrix multiplication and indexing to X3
. Create the array of indices:
In [51]: index = X3 @ [1, 2, 3]
In [52]: index.shape
Out[52]: (3186, 60)
In [53]: index
Out[53]:
array([[2, 0, 1, ..., 1, 3, 1],
[3, 3, 0, ..., 0, 1, 2],
[3, 3, 2, ..., 2, 3, 3],
...,
[3, 3, 2, ..., 3, 2, 1],
[0, 0, 2, ..., 0, 3, 2],
[1, 3, 1, ..., 0, 3, 3]])
Use index
and nucs
to convert the indices to nucleotide characters:
In [54]: nucleotides = nucs[index]
In [55]: nucleotides.shape
Out[55]: (3186, 60)
In [56]: nucleotides
Out[56]:
array([['C', 'T', 'A', ..., 'A', 'G', 'A'],
['G', 'G', 'T', ..., 'T', 'A', 'C'],
['G', 'G', 'C', ..., 'C', 'G', 'G'],
...,
['G', 'G', 'C', ..., 'G', 'C', 'A'],
['T', 'T', 'C', ..., 'T', 'G', 'C'],
['A', 'G', 'A', ..., 'T', 'G', 'G']], dtype='<U1')
If you want to combine the symbols in each row into single 60 character string, you can use
In [57]: sequences = nucleotides.view('U60')[:, 0]
In [58]: sequences.shape
Out[58]: (3186,)
In [59]: sequences
Out[59]:
array(['CTAGGCTCCAGATAGCCATAGAAGAACCAAACACTTTCTGCGTGTGTGAGAATAATCAGA',
'GGTGTTGCTCTTAGGATGTATCCCCTCAAACCTACCTGGTGGTTCTGTGCCTTCCCCTAC',
'GGCTGGACCGACCACAGCGCGTGCAGTAAGTCGGCCCCCTGCCCCGTCCTGCCCTGCCGG',
...,
'GGCCCAGCTGAGGTCTCTCTCTTCTCGCAGTTTCCATGTGGTACACACTCCCCCGCTGCA',
'TTCAGGGGAAAAAAATATCTTCTTGGCAAGGTAACACACTCTGTAAATGCATGTTCATGC',
'AGACACAGAAGTCCATCTATTACATCACTGGTGCGTTGACTCTGATTGAAGCCTTTTTGG'],
dtype='<U60')
Here's the code without all the intermediate data inspection:
nucs = np.array(['T', 'A', 'C', 'G'])
X3 = X.reshape(X.shape[0], -1, 3)
index = X3 @ [1, 2, 3]
nucleotides = nucs[index]
sequences = nucleotides.view('U60')[:, 0]