I'm trying to split a long sentence string up by delineating at periods / questions / and exclamations. The two main problems I'm running into are splitting up at decimals and splitting up a person's name (i.e. Frank N. Jackson becomes "Frank N" and " Jackson").
My current Regex is:
str.split(/[\.\!] (?!\d)\s*|\n \s*/)
I'm pretty sure I addressed the decimal concern, but my approach is still splitting up a person's name, which isn't ideal.
I figure that might be kinda tricky, but was wondering if anyone had any suggestions?
CodePudding user response:
If you can use negative look behind you can just do:
(?<![A-Z])[.!?](?![0-9])
This will yield matches such as
Frank N. Jackson was a guy[.] He stood 6.75 feet tall[.] He was a miner[.]
skipping any of the match characters followed by a number or preceeded by a capital letter.
CodePudding user response:
I have rudimentary solution.
const sentences = str.match(/([^?.!]|(?<=\d)\.(?=\d)|(?<=[A-Z])\.(?=\s[A-Z]))*([?!.])/g).map(s => s.trim());
Use str.match
instead of str.split
so you'll get the punctuation mark that ends your sentences.
([^?.!]|(?<=\d)\.(?=\d)|(?<=[A-Z])\.(?=\s[A-Z]))
refers to the valid sequence of characters and spaces that form a sentence. [^?.!]
denotes a sentence should not have a question mark or exclamation mark before the end. (?<=\d)\.(?=\d)
denotes a sentence can have a period before the end if the period is sandwiched between two digits. (?<=[A-Z])\.(?=\s[A-Z])
denotes a sentence can have a period before the end if it is preceded by a capital letter and succeeded by a space and a capital letter. The caveat of this is that you can't end your sentences in a capital letter before the sentence-terminating period.
[?!.]
is what terminates your sentences.
/g
set the matching global so you'll get all the sentences.
map(s => s.trim())
let us trim any whitespace before each sentence.
For the caveat I stated, you can try replacing it with (?<=\s[A-Z])\.(?=\s[A-Z])
. This denotes your sentence can contain a period before the end if preceded with a space and a capital letter and succeeded by a space and a capital letter. The caveat of this is that your matching will have trouble with names like "George H.W. Bush" and that your sentences cannot end in a single capital letter. You can try modifying the regex again to resolve these new caveats but that will only introduce new caveats. You need to be content with something and tolerate certain risks regarding the data you're dealing with.
My final suggestion would be:
const sentences = str.match(/([^?.!]|(?<=\d)\.(?=\d)|(?<=\s[A-Z])\.(?=\s[A-Z])|(?<=Adm|Atty|Capt|Col|Dr|Engr|Gen|Lt|Maj|Mr|Mrs|Ms|Msgr|Prof|Rev|Sgt)\.(?=\s[A-Z]))*([?!.])/g)!.map(s => s.trim());
Add more abbreviated titles if you want. Be wary of adding "Jr" or "Sr" though are those could be found at the end of sentences.