I want to split text in a data frame at the end of each sentence. I cannot just split the text at a period because there are many middle initials and acronyms inside the text.
I have a regex so I do not split the text at acronyms like u.s., n.y., and so forth. But I still have a problem where the text is split on middle initials. For example, on the "m" in the name "john m. smith".
pat <- "(?<!\\w\\.\\w.)(?<![A-Z][a-z]\\.)(?<=\\.|\\?)\\s"
text <- "u.s. judge john m. smith sentenced mark doe (23, largo) yesterday to 15 years in prison for being a felon in possession of a firearm. the felon lived in n.y."
str_split(text, pattern = pat)
I have reviewed existing solutions, such as this one (R break corpus into sentences), which does not provide a regex or any other code that does what I need.
Advice on how I can stop splitting on middle initials using a regex is appreciated.
The desired output splits these two sentences from each other:
u.s. judge john m. smith sentenced mark doe (23, largo) yesterday to 15 years in prison for being a felon in possession of a firearm.
and
the felon lived in n.y.
CodePudding user response:
Dots preceded by more than one letter should work:
pat <- "(?<=\\w{2,100})\\."
text <- "u.s. judge john m. smith sentenced mark doe (23, largo) yesterday to 15 years in prison for being a felon in possession of a firearm. the felon lived in n.y."
str_split(text, pattern = pat)
Output:
[[1]]
[1] "u.s. judge john m. smith sentenced mark doe (23, largo) yesterday to 15 years in prison for being a felon in possession of a firearm"
[2] " the felon lived in n.y."