Home > other >  How to set vocabulary size in python tokenizers library?
How to set vocabulary size in python tokenizers library?

Time:11-02

I would like to fit my own tokenizer and use it further for the pre-trained model, however, when fitting a new tokenizer there seems to be no way to choose the vocabulary size. So when I call tokenizer.get_vocab() it always returns a dictionary with 30000 elements. How do I change that? Here is what I do:

from tokenizers import Tokenizer
from tokenizers.models import BPE
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

from tokenizers.pre_tokenizers import Whitespace
tokenizer.pre_tokenizer = Whitespace()

tokenizer.train(['transcripts.raw'], trainer) # Here there are no additional arguments for some reason

CodePudding user response:

What you can do is use the vocab_size parameter of the BpeTrainer, which is set by default to 30000:

trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], vocab_size=10)

For more information, you can check out the docs.

  • Related