I would like to fit my own tokenizer and use it further for the pre-trained model, however, when fitting a new tokenizer there seems to be no way to choose the vocabulary size. So when I call tokenizer.get_vocab()
it always returns a dictionary with 30000 elements. How do I change that? Here is what I do:
from tokenizers import Tokenizer
from tokenizers.models import BPE
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
from tokenizers.pre_tokenizers import Whitespace
tokenizer.pre_tokenizer = Whitespace()
tokenizer.train(['transcripts.raw'], trainer) # Here there are no additional arguments for some reason
CodePudding user response:
What you can do is use the vocab_size
parameter of the BpeTrainer
, which is set by default to 30000:
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], vocab_size=10)
For more information, you can check out the docs.