I am using a CNN for binary classification in DNA sequences, but no matter what I do to restructure my data/network, I cannot get a 2D, binary classification, CNN to work. I can one hot encode the labels and use a 2 neuron, softmax, dense layer with a binary classification loss function, but that only hovers around 50% accuracy, never mind being completely wrong combined usage of activation and loss.
The data is 5000 DNA sequences (split into 4500 train/500 validate), each 1000 nucleotides long, that are tokenized and one hot encoded to be a 4x1000 matrix (A, T, C, G). The labels are just 0/1 to denote if they have a particular motif or not.
# Returns Pandas dataframe of names, sequences, and labels that I generated
totalSeqs = GenSeqs()
# Splitting data and labels in to train/validation sets
x_tr, x_val, y_tr, y_val = train_test_split(totalSeqs.Sequences.tolist(), totalSeqs.Labels.tolist(), test_size = 0.1)
x_tr, x_val, y_tr, y_val = np.array(x_tr), np.array(x_val), np.array(y_tr), np.array(y_val)
#Tokenizing sequences
tk = Tokenizer(num_words=None, char_level=True)
tk.fit_on_texts(x_tr)
tokenTrain = tk.texts_to_sequences(x_tr)
# One hot encoding tokenized sequences
oneHotTrain = OneHot(tokenTrain)
# Resizing to fit Conv2D and making sure there aren't any array/list conflicts
# Saw someone else had this issue, so I went overboard on preventing it
oneHotTrain = np.array(oneHotTrain).reshape(-1, 4500, 1000, 4)
for x in oneHotTrain:
x = np.array(x)
for i in x:
i = np.array(i)
for j in i:
j = np.array(j)
print(oneHotTrain.shape)
trainLabels = np.array(y_tr).reshape(-1, 4500, 1)
for x in trainLabels:
x = np.array(x)
for i in x:
i = np.array(i)
for j in i:
j = np.array(j)
print(trainLabels.shape)
This all outputs the shapes (1, 4500, 1000, 4) for the sequences and (1, 4500, 1) for the labels. From my understanding, these are the correct shapes, but it's hard to get exact information on label shapes.
From here, I create the CNN:
model = Sequential()
model.add(Conv2D(32, 4, activation='relu', input_shape = (4500, 1000, 4)))
model.add(MaxPooling2D(2))
model.add(Conv2D(64, 3, activation='relu'))
model.add(MaxPooling2D(2))
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
final = model.fit(oneHotTrain, trainLabels, batch_size = 100, epochs = 3, verbose = 1)
For reference, here is the one hot encoding function I use:
def OneHot(data):
num_classes = 4
new_data = []
for x in data:
class_vector = np.array(x)
categorical = np.zeros(class_vector.shape (num_classes,))
for c in range(1,5,1):
categorical[np.where(class_vector == c)]=np.array([1 if i == c else 0.0 for i in range(1,5,1)])
new_data.append(categorical)
return new_data
Its output turns out fine and the function I use to generate the "DNA" only create sequences that are 1000 characters long and made of A/T/C/G only. I've verified all this by outputting information from the Tokenizer, their lengths, etc., and either way, the final one hot matrices turn out fine, so I don't think the issue is there or even in the one hot function itself.
My assumption is the error lies somewhere in the CNN architecture/parameters or in the data/labels shapes, but if I could be missing something. Any suggestions?
CodePudding user response:
I found out the issue, which turned to a string of issues...
- The
input_size
for the first Conv2D layer needs 3 dimensions, with the first 2 being the length/width and the last being the depth. Since I'm dealing with a single text sequence, my input is(1000, 4, 1)
. If you were dealing with color images, then I'm assuming your final value would be 3, to account for the color channels. - Once that was setup, I got errors about both the expected ndims for the Conv2D layer and getting a negative dimensions. First, I reshaped my data to be
(4500, 1000, 4, 1)
, which worked and I assume is supposed to reflect the breakdown of each level of the data, but haven't quite got a clear understanding of that yet. - Finally, the issue with negative dimensions was coming from both the kernel size in the Conv2D and MaxPooling2D layers. Since I had them set to 4 and 2 respectively, that meant they were
(4,4)
and(2,2)
, which tried to apply 4 and 2 square dimensions to individual 4x1 portions of the sequences (because they were one hot encoded). To fix this, I just changed the first Conv2D and MaxPooling2D layers to(4,1)
and(2,1)
.
After doing all that, it worked fine and I'm finally getting good results. This was just a toy example that I created to understand why my research CNN wasn't working, so once I was able to figure this out, I got the research CNN pumping out results. Feels good.