I am using Swift 5 and would like to split an academic text into sentences.

I learned about the NaturalLanguage package and it works wonders for the majority of texts. However, I realized that they could not process some parts adequately, such as et al. and page number: "p. ", and ()() patterns.

Here is the replicable code:

import NaturalLanguage

var sentences: [String] = []
var str = "The information was reported by Brown et al. (2000). This should not have been the case (Brown et al., 2001, p. 10). Several other studies corroborate this (i.e., I don't know but something important, etc.) (Brown et al., 2002; Brown et al., 2003). But this is weird given the results of White et al. (2001)."

str.enumerateSubstrings(in: str.startIndex..., options: [.localized, .bySentences]) { (tag, _, _, _) in
           sentences.append(tag ?? "")
       }

sentences.forEach {
    print($0)
}


print(sentences)

expected results:

The information was reported by Brown et al. (2000). 
This should not have been the case (Brown et al., 2001, p. 10). 
Several other studies corroborate this (i.e., I don't know but something important, etc.) (Brown et al., 2002; Brown et al., 2003). 
But this is weird given the results of White et al. (2001).

results:

The information was reported by Brown et al. 
(2000). 
This should not have been the case (Brown et al., 2001, p. 
10). 
Several other studies corroborate this (i.e., I don't know but something important, etc.) 
(Brown et al., 2002; Brown et al., 2003). 
But this is weird given the results of White et al. 
(2001).


["The information was reported by Brown et al. ", "(2000). ", "This should not have been the case (Brown et al., 2001, p. ", "10). ", "Several other studies corroborate this (i.e., I don\'t know but something important, etc.) ", "(Brown et al., 2002; Brown et al., 2003). ", "But this is weird given the results of White et al. ", "(2001)."]

How can I deal with this? Is there any way I can manually deal with this? or can I use a similar but better package?

CodePudding user response：

One possible solution would be to escape these words/brackets. I didn´t performance test this and there is probably room for improvement.

let escapingElements = [" p." : " $p$", " et al." : " $et al$", "(" : "(", ")" : ")", " etc." : " $etc$"]

escapingElements.forEach { original, escaped in
    str = str.replacingOccurrences(of: original, with: escaped)
}

str.enumerateSubstrings(in: str.startIndex..., options: [.localized, .bySentences]) { (tag, _, _, _) in
           sentences.append(tag ?? "")
       }

escapingElements.forEach{ original, escaped in
    sentences = sentences.map{
        $0.replacingOccurrences(of: escaped, with: original)
    }
}

The idea behind this is to replace each element of concern with an unique value, that is not part of the text. Then use use NaturalLanguage to split the sentences. At the end go through each sentence and replace the espaced value with the original. (instead of your own escaped sequence you could use UUID)