Home > other >  Regex function to split paragraphs into sentences for Power query
Regex function to split paragraphs into sentences for Power query

Time:07-09

I am attempting to split an example paragraph into sentences using regex in Power Query:

Mr. and Mrs. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Dr. Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.However, this line wont do it. Qr. Test for Website.COM and Labs.ORG looks good.Creatively not working. t and finished. 9 to start

Into:

Mr. and Mrs. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.

Did he mind? Dr. Adam Jones Jr. thinks he didn't.

In any case, this isn't true...

Well, with a probability of .9 it isn't.

However, this line wont do it.

Qr.

Test for Website.

COM and Labs.

ORG looks good.

Creatively not working. t and finished.

9 to start

enter image description here

Alternatively (?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s can be found enter image description here

  • For demonstration purposes I loaded the data directly from Excel. I'm sure you can figure out how to connect your PDF;

  • Since the JavaScript-based function is a small HTML-script we have to escape the apostrope in the sample text first using a replace function. Otherwise it will clash with the apostrophes used to write the script in the function (see below). If we don't the function will error out/show nothing. Apostrophe will be shown correctly after applying function;

  • I edited the pattern to catch a full sentence in 1st capture group and for this sample I replaced what is captured with the backreference to this group and a pipe-symbol to visualize the result. Note there is no use of a negative lookbehind nomore since that is not supported in the engine. This resulted in a lengthy pattern which probably does not yet catch all the quirks possible:

    \s*((?:\b[MDJS]rs?\.|\d*\.\d |\S \.(?:com|net|org)\b|[a-z]\.(?:[a-z]\.) |[^.?!]) (?:[.?!] |$))
    

M-Code:

let
    Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Kol", type text}}),
    #"Replaced Value" = Table.ReplaceValue(#"Changed Type","'","&apos",Replacer.ReplaceText,{"Kol"}),
    #"Invoked Custom Function" = Table.AddColumn(#"Replaced Value", "fnRegexReplace", each fnRegexReplace([Kol], "\\s*((?:\\b[MDJS]rs?\\.|\\d*\\.\\d |\\S \\.(?:com|net|org)\\b|[a-z]\\.(?:[a-z]\\.) |[^.?!]) (?:[.?!] |$))", "$1|"))
in
    #"Invoked Custom Function"

Used function fnRegexReplace:

(x,y,z)=>
let 
   Source = Web.Page(
                     "<script>var x="&"'"&x&"'"&";var z="&"'"&z&
                     "'"&";var y=new RegExp('"&y&"','gmi');
                     var b=x.replace(y,z);document.write(b);</script>")
                     [Data]{0}[Children]{0}[Children]{1}[Text]{0}
in 
   Source

An online demo of the regular expression.

  • Related