Huggingface - Finetuning in Tensorflow with custom datasets-CodePudding

I have been battling with my own implementation on my dataset with a different transformer model than the tutorial, and I have been getting this error AttributeError: 'NoneType' object has no attribute 'dtype', when i was starting to train my model. I have been trying to debug for hours, and then I have tried the tutorial from hugging face as it can be found here https://huggingface.co/transformers/v3.2.0/custom_datasets.html. Running this exact code, so I could identify my mistake, also leads to the same error.

!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

from pathlib import Path
def read_imdb_split(split_dir):
    split_dir = Path(split_dir)
    texts = []
    labels = []
    for label_dir in ["pos", "neg"]:
        for text_file in (split_dir/label_dir).iterdir():
            texts.append(text_file.read_text())
            labels.append(0 if label_dir is "neg" else 1)

    return texts, labels

train_texts, train_labels = read_imdb_split('aclImdb/train')
test_texts, test_labels = read_imdb_split('aclImdb/test')

from sklearn.model_selection import train_test_split
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

import tensorflow as tf

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
))
val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    val_labels
))
test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    test_labels
))

from transformers import TFDistilBertForSequenceClassification

model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')

optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss=model.compute_loss) # can also use any keras loss fn
model.fit(train_dataset.shuffle(1000).batch(16), epochs=3, batch_size=16)

My goal will be to perform multi-label text classification on my own custom dataset, which unfortunately I cannot share for privacy reasons. If anyone could point out what is wrong with this implementation, will be highly appreciated.

CodePudding user response：

There seems to be an error, when you are passing the loss parameter.

model.compile(optimizer=optimizer, loss=model.compute_loss) # can also use any keras loss fn

You don't need to pass the loss parameter, if you want to use the model's built-in loss function.

I was able to train the model with your provided source code by changing mentioned line to:

model.compile(optimizer=optimizer)

or by passing a loss function

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

model.compile(optimizer=optimizer, loss=loss_fn)

transformers version: 4.20.1

Hope it helps.