Balancing samples on a binary classification sequence problem with sparse positive labels-CodePudding

I am working on a binary classification problem that involves samples that are sequences of timesteps. My model should give a prediction, either 0 or 1, for each timestep of each sample. Not sure if relevant, but I am using LSTMs and a dense w/ sigmoid activation function. I am not obtaining the results I would expect and I suspect about my sampling method.

Each sample of mine, from the label's perspective, looks something like this:

y = [0,0,0,0,0,0,0,0,...,0,0,0,1,1,1,0,0,0,0,0,0] # usually it possesses around 4000 timesteps

I am considering that a positive sample is one that contains at least one timestep whose label is 1. Logically, a negative sample is one that doesn't contain any 1, only 0 values in its label vector. With this in mind, I have 1300 positive samples and 6200 negative samples. So, only around 20% of my samples have at least a label that I want to classify as 1, while the rest are entirely full of 0 values. I have then distributed my samples across training, validation and test sets as it is commonly done - some kind of 0.80 - 0.10 - 0.10 - ensuring that, between them, they keep a similar % of positive and negative samples.

Regarding my training, it seems very unstable despite having tried multiple configurations of batch sizes, learning rates, etc... My loss function and metrics 'tend' to decrease, but they are very unstable and often peak out of nowhere. There could be some improvement in the tuning/optimization phase, no doubt, but I suspect the biggest problem might not reside here but in the sampling. The model is still able to learn something during training, but far from hitting the metric values, I would expect.

With this context established I would like to obtain some guidance on the following questions:

It happens that this sampling proportion is coherent with the scenario this problem derives from. So, it is a fact that the event associated with label 1 is very rare in relation to its absence... Even if my samples reflect the sparsity that characterizes the occurrence of the label 1 - there could be many 'samples/timesteps' in real life without absolutely no 1 -, should I still opt to equalize the number of positive and negative samples within my datasets, by not considering the excessive amount of negative samples I have? So, for example, considering my positive 1300 samples and only about 1300 negative samples?
Some of my samples actually have very few timesteps. If the normal length is 4000 timesteps, some samples of mine actually include only around 5-20 timesteps. Is it bad for such samples to be fed to the model? I am worried that they don't bring any additional learning and actually prejudice the learning process by 'cluttering' the batches of samples that are used to update parameters -- not sure if that makes sense or if it just makes no difference. If it does make a difference to feed those meaningless samples, it could maybe be a good idea to go with the logic in point .1 and start by removing the smaller samples that are also negative?

What are your opinions about this problem? Really curious to hear them!

CodePudding user response：

Yes, a 50 / 50 balanced training set makes sense.
Yes, rejecting each short sequence, even before examining whether it's positive or negative, make sense.

This question suffers from not being reproducible / testable. https://stackoverflow.com/help/minimal-reproducible-example

Don't fall into this trap:

0, 0, 0, ..., 0, 1, 1, 1, 0, ..., 0, 0, 0
A             B           C             D

Some training approaches might draw A..C and B..D as two "distinct" samples, despite the shared positive stretch in the middle. Avoid doing that.

Given that you designed this as an LSTM solution, that suggests that training on sample B..D might not be very helpful, as there hasn't been much time for the state to evolve before we see the positives. Consider constraining your positive samples to always be negative in the initial N% of the sample.