In NLP how to tokenize words like "Salesman" into Sales and man?-CodePudding

I am trying to correlate two documents one is having sale and other is having salesman , saleswomen. Is there a method in Python, NLP to split or tokenize salesman into sales and man?

Update: I have to process a large dataset. So, adding individual cases may be little difficult.

I found a library splitter https://github.com/TimKam/compound-word-splitter

import splitter
print(splitter.split('artfactory'))
print(splitter.split('salesman'))

but it is is working for art factory / but not salesman Output ['art', 'factory'] ['salesman']

CodePudding user response：

I think a possible solution for you would be to write an Regex expression in order to map the word "Sales" and then extract everything that comes after it (in your case, "man" or "women"). Take a look at Regex and lookahead techniques in Python, if you are not familiar with.

CodePudding user response：

I don't know about nltk but you can do that with spaCy. Here's a demonstration, slightly edited, copied straight from their docs.

import spacy
from spacy.symbols import ORTH

nlp = spacy.load("en_core_web_sm")
doc = nlp("a salesman")  # phrase to tokenize
print([w.text for w in doc])  # ['a', 'salesman']

# Add special case rule
special_case = [{ORTH: "sales"}, {ORTH: "man"}]
nlp.tokenizer.add_special_case("salesman", special_case)

# Check new tokenization
print([w.text for w in nlp("a salesman")])  # ['a', 'sales', 'man']

There must be better ways to add rules for similar words like saleswoman, should you have that need, but even a for-loop them would do.