What is the difference between Tokenization and Segmentation in NLP. I searched about them but I didn't really find any differences .
CodePudding user response:
Short answer: All tokenization is segmentation, but not all segmentation is tokenization.
Long Answer:
While segmentation is a more generic concept of splitting the input text, tokenization is a type of segmentation and it is carried out based on a well defined criteria.
For example - in a hypothetical scenario if all your input sentences are compound sentences of two sub-sentences, then splitting them into two independent sentences can be termed as segmentation (but not tokenization).
Tokenization is a form of segmentation which is performed on the basis of a semantic criteria or using a token dictionary - e.g. a word or sub-word tokenization, mainly with an intention of assigning them token ids for downstream processing.