I am using an LSTM for binary classification and initially tried a model with 1 unit in the output(Dense) layer with sigmoid as the activation function. However, it didn't perform well and I saw a few notebooks where they used 2 units in the output layer(the layer immediately after the LSTM) with softmax as the activation function. Is there any advantage to using 2 output layers and using softmax instead of a single unit and sigmoid(For the purpose of binary classification)? I am using binary_crossentropy as the loss function
CodePudding user response:
Softmax should be better than sigmoid as the slope of derivative of sigmoid would almost be closer to one(vanishing gradient problem)., which makes it difficult to classify. That might be the reason for softmax to perform better than sigmoid