Home > front end >  Discrepancy between results reported by TensorFlow model.evaluate and model.predict
Discrepancy between results reported by TensorFlow model.evaluate and model.predict

Time:08-02

I've been back and forth with this for ages, but without being able to find a solution so far anywhere. So, I have a HuggingFace model ('bert-base-cased') that I'm using with TensorFlow and a custom dataset. I've: (1) tokenized my data (2) split the data; (3) converted the data to TF dataset format; (4) instantiated, compiled and fit the model.

During training, it behaves as you'd expect: training and validation accuracy go up. But when I evaluate the model on the test dataset using TF's model.evaluate and model.predict, the results are very different. The accuracy as reported by model.evaluate is higher (and more or less in line with the validation accuracy); the accuracy as reported by model.predict is about 10% lower. (Maybe it's just a coincidence, but it's similar to the reported training accuracy after the single epoch of fine-tuning.)

Can anyone figure out what's causing this? I include snippets of my code below.

# tokenize the dataset
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path="bert-base-cased",use_fast=False)

def tokenize_function(examples):
  return tokenizer(examples['text'], padding = "max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# splitting dataset
trainSize = 0.7
valTestSize = 1 - trainSize
train_testvalid = tokenized_datasets.train_test_split(test_size=valTestSize,stratify_by_column='class')
valid_test = train_testvalid['test'].train_test_split(test_size=0.5,stratify_by_column='class')

# renaming each of the datasets for convenience
train_set = train_testvalid['train']
val_set = valid_test['train']
test_set = valid_test['test']

# converting the tokenized datasets to TensorFlow datasets
data_collator = DefaultDataCollator(return_tensors="tf")
tf_train_dataset = train_set.to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=['class'],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=8)
tf_validation_dataset = val_set.to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=['class'],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=8)
tf_test_dataset = test_set.to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=['class'],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=8)

# loading tensorflow model
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=1)

# compiling the model
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=5e-6),
    loss=tf.keras.losses.BinaryCrossentropy(),
    metrics=tf.metrics.BinaryAccuracy())

# fitting model
history = model.fit(tf_train_dataset,
          validation_data=tf_validation_dataset,
          epochs=1)

# Evaluating the model on the test data using `evaluate`
results = model.evaluate(x=tf_test_dataset,verbose=2) # reports binary_accuracy: 0.9152

# first attempt at using model.predict method
hits = 0
misses = 0
for x, y in tf_test_dataset:
  logits = tf.keras.backend.get_value(model(x, training=False).logits)
  labels = tf.keras.backend.get_value(y)
  for i in range(len(logits)):
    if logits[i][0] < 0:
      z = 0
    else:
      z = 1
    if z == labels[i]:
      hits  = 1
    else:
      misses  = 1
print(hits/(hits misses)) # reports binary_accuracy: 0.8187

# second attempt at using model.predict method
modelPredictions = model.predict(tf_test_dataset).logits
testDataLabels = np.concatenate([y for x, y in tf_test_dataset], axis=0)
hits = 0
misses = 0
for i in range(len(modelPredictions)):
  if modelPredictions[i][0] >= 0:
    z = 1
  else:
    z = 0
  if z == testDataLabels[i]:
    hits  = 1
  else:
    misses  = 1

print(hits/(hits misses)) # reports binary_accuracy: 0.8187

Things I've tried include:

  1. different loss functions (it's a binary classification problem with the label column of the dataset filled with either a zero or a one for each row);

  2. different ways of unpacking the test dataset and feeding it to model.predict;

  3. altering the 'num_labels' parameter between 1 and 2.

CodePudding user response:

This behavior is completely normal. It is correct that there is some difference between evaluate and predict. This is because the two functions work by two different mechanisms. model.predict() simply returns the final output of the model (this are the actual predictions). While model.evaluate() predicts the output from the model and computes the loss and metrics specified, in your case tf.metrics.BinaryAccuracy(), and returns this computed metric (the result of the metric is the output).

So the difference is that model.predict() uses only the prediction y_pred from the model, while model.evaluate() not only uses y_pred but also y_true, the ground truth.

These readings might come in handy to help you better understand how the two operate:

Also your script seems okay to me. I usually stick to predict when I have to "evaluate" my final model, and compute the accuracy on my own.

CodePudding user response:

I fixed the problem by changing the num_labels parameter to two and the loss function to sparse categorical cross entropy. (I then had to change my model.predict loop by taking the argmax of the two logits produced by the model.)

  • Related