Home > OS >  Is there a function in models in tensorflow.keras similar to partial_fit in sklearn's MLPClassi
Is there a function in models in tensorflow.keras similar to partial_fit in sklearn's MLPClassi

Time:08-19

I'm new to machine learning and trying to create a keystroke biometrics program and am using the benchmark keystroke biometric dataset (https://www.cs.cmu.edu/~keystroke/DSL-StrongPasswordData.csv). My goal is to first train the model on the available number of users, and then continue training the same model to be able to predict new users.

data = pd.read_csv('keystroke.csv')

y = pd.get_dummies(data, columns=['subject']).loc[:, "subject_s002":]
X = data.loc[:, "H.period":"H.Return"]

for col in X.columns:
    X[col] = (X[col] - X[col].mean()) / X[col].std()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = tf.keras.models.Sequential()
model.add(Dense(200, input_shape=(31,), activation='sigmoid'))
model.add(tf.keras.layers.Dropout(0.3))
model.add(Dense(51, activation='exponential', use_bias=True)) 
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

The problem arises because I have to specify the number of nodes in the output layer to be equal to the number of labels in my dataset. While I initially have 51 users, I hope to be able to increase these users, and train on the same model. But to do that, I have to pop the current output layer, and create a new one with the new number of users, and then I have to retrain the model on the entire dataset which is coming out to be expensive in terms of time.

I tried doing a similar thing using sklearn's MLPClassifier - I divided the dataset into separate sets of 25 and 26 users, and then used partial_fit 500 times on the first 25 users, and then on the remaining 26 users. But the score for the first 25 users is falling to 0 after I use partial_fit for the remaining 26 users (I used an initial learning rate of 0.0001 and set learning rate to be adaptive).

clf = MLPClassifier(activation='logistic', alpha=0, max_iter=900, solver='adam',\
                 hidden_layer_sizes=(400,), learning_rate_init=0.00001, \
                 warm_start=True, shuffle=True, learning_rate='adaptive')
for i in range(500):
  clf.partial_fit(X_train25, y_train25, classes=subjects)

score25 = clf.score(X_test25, y_test25) # 0.7995

for i in range(500):
  clf.partial_fit(X_train26, y_train26)
score25 = clf.score(X_test25, y_test25) # 0.826
score26 = clf.score(X_test26, y_test26) # 0.0

So I'm wondering if its possible to do what partial_fit does in sklearn in tensorflow.keras? Or if there's a way to retain accuracy for the first 25 users in the sklearn model after using partial_fit for the remaining 26 users?

CodePudding user response:

One pragmatic approach is just to make your Y data (and your final dense layer) much wider than your current number of users. If you start with 51 users have, say 100 columns in your Y data, of which the last 49 are always zero. Your final dense layer also has 100 units.

If you train your model on that your model should never make a prediction for a user > 51, since no Y data has these labels.

Then when user 52 turns up, you just start to populate column 52 and continue to train the same model.

It's an approach, but I admit it's not elegant. At some point, you will hit a limit and need to retrain the whole thing.

CodePudding user response:

I divided the dataset into separate sets of 25 and 26 users, and then used partial_fit 500 times on the first 25 users, and then on the remaining 26 users. But the score for the first 25 users is falling to 0 after I use partial_fit for the remaining 26 users

This is completely normal. This problem is known as catastrophic forgetting. If you try training new classes without providing examples for the old ones, these will be forgotten. In order to handle models as dynamic objects, able to pick up new knowledge throughout time you have to apply techniques of Incremental Learning (IL).

IL is about learning a new task, having access only (not for all the techniques but for the majority yes) to the current task’s batch of data. A task is defined as a set of classes to be learned, which are added to those already trained. You want to find a trade-off between learning a new task, and preserving prior knowledge.

There are lots of techniques that fall under the name of incremental learning. One of the easiest approaches I think is task-incremental learning.

Basically you can train a set of n classes at a time. Each set is called a task. To train a new task you have to add a new head (an output layer) to the model, specific for that task. So each task has its unique head. To do inference you simply input the test data to your model, and to get the prediction you'll have to know the ID of the head to look at (or not but it’s harder).

To do the training of a task, you add the new head, freeze all the other heads so that they are not changed and usually apply a loss that can help minimize the drifting in accuracies of old tasks. A technique you can look at is called Learning without Forgetting (LwF), but really there are lots of techniques that can help you.

With LwF you can apply a loss of knowledge distillation (to learn more). To add a little bit of code, what you aim at is something like this:

def k_dist_loss(logits, labels, T):
    """
    Computes the knowledge distillation loss that constrain outputs for original tasks to be similar to the
    original network. Logits are prediction logits (y_pred), labels, are the true labels (y_true), T a temperature.
    """

    logits = tf.cast(logits, dtype='float32')
    labels = tf.cast(labels, dtype='float32')

    outputs = tf.nn.log_softmax(logits / T, axis=1)
    labels = tf.nn.softmax(labels / T, axis=1)
    outputs = tf.reduce_sum(outputs * labels, keepdims=False, axis=1)
    outputs = -tf.reduce_mean(outputs, keepdims=False, axis=0)

    return outputs  # knowledge distillation loss

def lwf_loss_f(labels, logits, heads_labels, T, lambda0):
    """
    Computes a loss that is the sum of two components:
    - 1 the loss for the last task head, currently in training, for example a ce loss
    - 2 a loss to avoid forgetting old tasks, multiplied by the lambda0 value to preserve old outputs. 
    The distillation loss component aims at minimizing the distance between outputs of the old model for old tasks and outputs of the new model for old tasks. 
    This should preserve the accuracy on old tasks after the training of a new task using only new task data.
    """

    ce = # obtain cross entropy or other loss for last layer added

    distillation_loss = 0
    for i in range(len(heads_labels)):
        distillation_loss  = k_dist_loss(labels=heads_labels[i], logits=logits[i], T=T)
    
    loss = ce   lambda0 * distillation_loss
    return loss

@tf.function
def train_step(x, y):
    # Open a GradientTape to record the operations run
    # during the forward pass, which enables auto-differentiation.
    with tf.GradientTape() as tape:
        logits = model(x, training=True)
        # logits is an array of the logits of each task head
        loss = lwf_loss_f(labels=y, logits=logits, heads_labels=heads_labels, T=T, lambda0=lambda0)

    grads = tape.gradient(total_loss, model.trainable_weights)
    optimizer.apply_gradients(zip(grads, model.trainable_weights))

    return loss

# example train loop 
for epoch in range(epochs):
    # Iterate over the batches of the dataset.
    for batch in range(num_batch):
        images = x_train[batch * batch_size: (batch   1) * batch_size]
        labels = y_train[batch * batch_size: (batch   1) * batch_size]

        heads_labels = prev_model(images, training=False)
        loss = train_step(images, labels)

Where prev_model is the model at the previous step (without the new head). It is used to transfer the knowledge of the old tasks to the new model. Before the new train remember to manually add a new output layer to the model for the new task.

  • Related