Apple's Natural Language API returns unexpected results-CodePudding

I'm trying to figure out why Apple's Natural Language API returns unexpected results.

What am I doing wrong? Is it a grammar issue?

I have the following four strings, and I want to extract each word's "stem form."

    // text 1 has two "accredited" in a different order
    let text1: String = "accredit accredited accrediting accredited accreditation accredits"
    
    // text 2 has three "accredited" in different order
    let text2: String = "accredit accredits accredited accrediting accredited accredited accreditation"
    
    // text 3 has "accreditation"
    let text3: String = "accreditation"
    
    // text 4 has "accredited"
    let text4: String = "accredited"

The issue is with the words accreditation and accredited.

The word accreditation never returned the stem. And accredited returns different results based on the words' order in the string, as shown in Text 1 and Text 2 in the attached image.

I've used the code from Apple's documentation

And here is the full code in SwiftUI:

import SwiftUI
import NaturalLanguage

struct ContentView: View {
    
    // text 1 has two "accredited" in a different order
    let text1: String = "accredit accredited accrediting accredited accreditation accredits"
    
    // text 2 has three "accredited" in a different order
    let text2: String = "accredit accredits accredited accrediting accredited accredited accreditation"
    
    // text 3 has "accreditation"
    let text3: String = "accreditation"
    
    // text 4 has "accredited"
    let text4: String = "accredited"
    
    var body: some View {
        ScrollView {
            VStack {
                
                Text("Text 1").bold()
                tagText(text: text1, scheme: .lemma).padding(.bottom)
                
                Text("Text 2").bold()
                tagText(text: text2, scheme: .lemma).padding(.bottom)
                
                Text("Text 3").bold()
                tagText(text: text3, scheme: .lemma).padding(.bottom)
                
                Text("Text 4").bold()
                tagText(text: text4, scheme: .lemma).padding(.bottom)
                
            }
        }
    }
    
    // MARK: - tagText
    func tagText(text: String, scheme: NLTagScheme) -> some View {
        VStack {
            ForEach(partsOfSpeechTagger(for: text, scheme: scheme)) { word in
                Text(word.description)
            }
        }
    }
    
    // MARK: - partsOfSpeechTagger
    func partsOfSpeechTagger(for text: String, scheme: NLTagScheme) -> [NLPTagResult] {
        
        var listOfTaggedWords: [NLPTagResult] = []
        let tagger = NLTagger(tagSchemes: [scheme])
        tagger.string = text
        
        let range = text.startIndex..<text.endIndex
        let options: NLTagger.Options = [.omitPunctuation, .omitWhitespace]
        
        tagger.enumerateTags(in: range, unit: .word, scheme: scheme, options: options) { tag, tokenRange in
            
            if let tag = tag {
                let word: String = String(text[tokenRange])
                let result = NLPTagResult(word: word, tag: tag)
                
                //if !word.localizedCaseInsensitiveContains(tag.rawValue) {
                listOfTaggedWords.append(result)
                //}
            }
            return true
        }
        
        return listOfTaggedWords
    }
    
    // MARK: - NLPTagResult
    struct NLPTagResult: Identifiable, Equatable, Hashable, Comparable {
        var id = UUID()
        var word: String
        var tag: NLTag?
        
        var description: String {
            var newString: String = "\(word)"
            
            if let tag = tag {
                newString  = " : \(tag.rawValue)"
            }
            
            return newString
        }
        
        // MARK: - Equatable & Hashable requirements
        static func == (lhs: Self, rhs: Self) -> Bool {
            lhs.id == rhs.id
        }
        
        func hash(into hasher: inout Hasher) {
            hasher.combine(id)
        }
        
        // MARK: - Comparable requirements
        static func <(lhs: NLPTagResult, rhs: NLPTagResult) -> Bool {
            lhs.id.uuidString < rhs.id.uuidString
        }
    }
    
}

// MARK: - Previews
struct ContentView_Previews: PreviewProvider {
    static var previews: some View {
        ContentView()
    }
}

Thanks for your help!

CodePudding user response：

As for why the tagger doesn't find "accredit" from "accreditation", this is because the scheme .lemma finds the lemma of words, not actually the stems. See the difference between stem and lemma on Wikipedia.

The stem is the part of the word that never changes even when morphologically inflected; a lemma is the base form of the word. For example, from "produced", the lemma is "produce", but the stem is "produc-". This is because there are words such as production and producing In linguistic analysis, the stem is defined more generally as the analyzed base form from which all inflected forms can be formed.

The documentation uses the word "stem", but I do think that the lemma is what is intended here, and getting "accreditation" is the expected behaviour. See the Usage section of the Wikipedia article for "Word stem" for more info. The lemma is the dictionary form of a word, and "accreditation" has a dictionary entry, whereas something like "accredited" doesn't. Whatever you call these things, the point is that there are two distinct concepts, and the tagger gets you one of them, but you are expecting the other one.

As for why the order of the words matters, this is because the tagger tries to analyse your words as "natural language", rather than each one individually. Naturally, word order matters. If you use .lexicalClass, you'll see that it thinks the third word in text2 is an adjective, which explains why it doesn't think its dictionary form is "accredit", because adjectives don't conjugate like that. Note that accredited is an adjective in the dictionary. So "is it a grammar issue?" Exactly.