How do you get the stem form of a single word token? Here is my code. It works for some words, but not others.
let text = "people" // works
// let text = "geese" // doesn't work
let tagger = NLTagger(tagSchemes: [.lemma])
tagger.string = text
let (tag, range) = tagger.tag(at: text.startIndex, unit: .word, scheme: .lemma)
let stemForm = tag?.rawValue ?? String(text[range])
However, if I lemmatize the entire text it's able to find all the stem forms of words.
let text = "This is text with plurals such as geese, people, and millennia."
let tagger = NLTagger(tagSchemes: [.lemma])
tagger.string = text
var words: [String] = []
tagger.enumerateTags(in: text.startIndex..<text.endIndex, unit: .word, scheme: .lemma, options: [.omitWhitespace, .omitPunctuation]) { tag, range in
let stemForm = tag?.rawValue ?? String(text[range])
words = [stemForm]
return true
}
// this be text with plural such as goose person and millennium
words.joined(separator: " ")
Also, is it possible to reverse the process and find the plural version of a stem word?
CodePudding user response:
If you set the language of the text before tagging it, it works:
tagger.string = text
tagger.setLanguage(.english, range: text.startIndex..<text.endIndex)
let (tag, range) = tagger.tag(at: text.startIndex, unit: .word, scheme: .lemma)
Without setting a language, the tagger guesses the language. Apparently, just "geese" alone is too little information for it to guess that it is English. If you check dominantLanguage
without setting the language explicitly, it is apparently Dutch.