R: Split data frame text for each sentence using a regex but ignore middle initials-CodePudding

I want to split text in a data frame at the end of each sentence. I cannot just split the text at a period because there are many middle initials and acronyms inside the text.

I have a regex so I do not split the text at acronyms like u.s., n.y., and so forth. But I still have a problem where the text is split on middle initials. For example, on the "m" in the name "john m. smith".

pat <- "(?<!\\w\\.\\w.)(?<![A-Z][a-z]\\.)(?<=\\.|\\?)\\s"
text <- "u.s. judge john m. smith sentenced mark doe (23, largo) yesterday to 15 years in prison for being a felon in possession of a firearm. the felon lived in n.y."

str_split(text, pattern = pat)

I have reviewed existing solutions, such as this one (R break corpus into sentences), which does not provide a regex or any other code that does what I need.

Advice on how I can stop splitting on middle initials using a regex is appreciated.

The desired output splits these two sentences from each other:

u.s. judge john m. smith sentenced mark doe (23, largo) yesterday to 15 years in prison for being a felon in possession of a firearm.

and

the felon lived in n.y.

CodePudding user response：

Dots preceded by more than one letter should work:

pat <- "(?<=\\w{2,100})\\."
text <- "u.s. judge john m. smith sentenced mark doe (23, largo) yesterday to 15 years in prison for being a felon in possession of a firearm. the felon lived in n.y."

str_split(text, pattern = pat)

Output:

[[1]]
[1] "u.s. judge john m. smith sentenced mark doe (23, largo) yesterday to 15 years in prison for being a felon in possession of a firearm"
[2] " the felon lived in n.y."