I am attempting to split an example paragraph into sentences using regex in Power Query:
Mr. and Mrs. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Dr. Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.However, this line wont do it. Qr. Test for Website.COM and Labs.ORG looks good.Creatively not working. t and finished. 9 to start
Into:
Mr. and Mrs. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.
Did he mind? Dr. Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.
However, this line wont do it.
Qr.
Test for Website.
COM and Labs.
ORG looks good.
Creatively not working. t and finished.
9 to start
Alternatively (?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s
can be found
For demonstration purposes I loaded the data directly from Excel. I'm sure you can figure out how to connect your PDF;
Since the JavaScript-based function is a small HTML-script we have to escape the apostrope in the sample text first using a replace function. Otherwise it will clash with the apostrophes used to write the script in the function (see below). If we don't the function will error out/show nothing. Apostrophe will be shown correctly after applying function;
I edited the pattern to catch a full sentence in 1st capture group and for this sample I replaced what is captured with the backreference to this group and a pipe-symbol to visualize the result. Note there is no use of a negative lookbehind nomore since that is not supported in the engine. This resulted in a lengthy pattern which probably does not yet catch all the quirks possible:
\s*((?:\b[MDJS]rs?\.|\d*\.\d |\S \.(?:com|net|org)\b|[a-z]\.(?:[a-z]\.) |[^.?!]) (?:[.?!] |$))
M-Code:
let
Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Kol", type text}}),
#"Replaced Value" = Table.ReplaceValue(#"Changed Type","'","&apos",Replacer.ReplaceText,{"Kol"}),
#"Invoked Custom Function" = Table.AddColumn(#"Replaced Value", "fnRegexReplace", each fnRegexReplace([Kol], "\\s*((?:\\b[MDJS]rs?\\.|\\d*\\.\\d |\\S \\.(?:com|net|org)\\b|[a-z]\\.(?:[a-z]\\.) |[^.?!]) (?:[.?!] |$))", "$1|"))
in
#"Invoked Custom Function"
Used function fnRegexReplace
:
(x,y,z)=>
let
Source = Web.Page(
"<script>var x="&"'"&x&"'"&";var z="&"'"&z&
"'"&";var y=new RegExp('"&y&"','gmi');
var b=x.replace(y,z);document.write(b);</script>")
[Data]{0}[Children]{0}[Children]{1}[Text]{0}
in
Source
An online demo of the regular expression.