I have the following function, which splits text into sentences using regex. Upon testing, however, these are instances where the regex doesn't quite work, and the text is wrongly split. For example, if the text contains St. bernard, I do not want this sentence to be split on the . of St.
As a nice workaround, I have modified the regex to allow for exceptions to be ignored. Please see
regex:
\s*((?:\b(?:[djms]rs?|flam|liq|St)\.|\b(?:[a-z]\.){2,}|\.\d[\d.]*|\.(?:com|net|org)\b|[^.?!]) (?:[.?!] |$))
(
M code attempting to achieve this results in an error:
Sentences From text:
let
Exceptions = Exceptions,
Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Text", type text}}),
#"Replaced Value1" = Table.ReplaceValue(#"Changed Type","#(lf)"," ",Replacer.ReplaceText,{"Text"}),
#"Replaced Value" = Table.ReplaceValue(#"Replaced Value1","'","&apos",Replacer.ReplaceText,{"Text"}),
#"Invoked Custom Function" = Table.AddColumn(#"Replaced Value", "fnRegexReplace", each fnRegexReplace([Text], "\s*((?:\b(?:[djms]rs?"&Exceptions&")\.|\b(?:[a-z]\.){2,}|\.\d[\d.]*|\.(?:com|net|org)\b|[^.?!]) (?:[.?!] |$))", "$1|")),
#"Removed Other Columns" = Table.SelectColumns(#"Invoked Custom Function",{"fnRegexReplace"}),
#"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Removed Other Columns", {{"fnRegexReplace", Splitter.SplitTextByDelimiter("|", QuoteStyle.None), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "fnRegexReplace"),
#"Filtered Rows" = Table.SelectRows(#"Split Column by Delimiter", each ([fnRegexReplace] <> ""))
in
#"Filtered Rows"
Exceptions:
let
Source = Excel.CurrentWorkbook(){[Name="Table2"]}[Content],
#"Added Custom" = Table.AddColumn(Source, "Custom", each "Exceptions"),
#"Grouped Rows" = Table.Group(#"Added Custom", {"Custom"}, {{"Exceptions", each Text.Combine([#"Do not split if:"],"|"), type text}})
in
#"Grouped Rows"
fnRegexReplace
(x,y,z)=>
let
y = Text.Replace(y,"\","\\"),
Source = Web.Page(
"<script>var x="&"'"&x&"'"&";var z="&"'"&z&
"'"&";var y=new RegExp('"&y&"','gmi');
var b=x.replace(y,z);document.write(b);</script>")
[Data]{0}[Children]{0}[Children]{1}[Text]{0}
in
Source
Error:
Raw text Data:
Highly Flammable Liquid Flam. H223 Liq. H334.
St. Bernard Dog was present.
The MW of gold is 100.1. Solubility is 40mg/L.
Im sure this is an easy fix, but whatever I try, i.e. Record.FromTable{0} etc I get various errors.
If anyone could help me out, that would be great.
Thank you.
CodePudding user response:
This is the problem line fixed.
= Table.AddColumn(#"Replaced Value", "fnRegexReplace", each fnRegexReplace([Text], "\s*((?:\b(?:[djms]rs?"&Exceptions[Exceptions]{0}&")\.|\b(?:[a-z]\.){2,}|\.\d[\d.]*|\.(?:com|net|org)\b|[^.?!]) (?:[.?!] |$))", "$1|"))