I have NLP problem that involves some coding assignments such as "fn_point->extract.isolate_r" and when I use word_tokenize
, the assignment "->
" is split like this ["fn_point-", ">", "extract.isolate_r"]
.
I did the following:
from nltk.tokenize import word_tokenize
sentence = "The functional list fn_point->extract.isolate_r of size 32 is not valid"
new_sent = word_tokenize(sentence)
print(new_sent)
How to keep "->
" as one word since it is an assignment operator in c programming language?
CodePudding user response:
This is a little bit ad-hoc but does the job:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer('[\w\.] |\d |\->')
tokenizer.tokenize(sentence)
OUTPUT
['The', 'functional', 'list', 'fn_point', '->', 'extract.isolate_r', 'of', 'size', '32', 'is', 'not', 'valid']