Home > Software engineering >  How to keep special characters together in word_tokenize?
How to keep special characters together in word_tokenize?

Time:12-17

I have NLP problem that involves some coding assignments such as "fn_point->extract.isolate_r" and when I use word_tokenize, the assignment "->" is split like this ["fn_point-", ">", "extract.isolate_r"].

I did the following:

from nltk.tokenize import word_tokenize
sentence = "The functional list fn_point->extract.isolate_r of size 32 is not valid"
new_sent = word_tokenize(sentence)
print(new_sent)

How to keep "->" as one word since it is an assignment operator in c programming language?

CodePudding user response:

This is a little bit ad-hoc but does the job:

from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer('[\w\.] |\d |\->')

tokenizer.tokenize(sentence)

OUTPUT

['The', 'functional', 'list', 'fn_point', '->', 'extract.isolate_r', 'of', 'size', '32', 'is', 'not', 'valid']
  • Related