Home > front end >  Split sentences by Case change where two words are "stuck" together
Split sentences by Case change where two words are "stuck" together

Time:07-03

I am attempting to clean up the following data which has been extracted from HTML.

Some sentences haven't quite split correctly with the Capitalised word at the start of one sentence "stuck" to the preceding word.

The image below illustrates what I am trying to achieve:

enter image description here

So in essence if there is a sentence like: The boy plays with the ballThe Girl plays with the Console in a row. This would split to:

The boy plays with the ball
The Girl plays with the Console

M code so far with the actual data ( must be run in power BI as uses Html.Table function which is not available in excel).

let
    Source = Table.FromColumns({Lines.FromBinary(Web.Contents("https://echa.europa.eu/registration-dossier/-/registered-dossier/14184/7/1"))}),
    #"Added Custom" = Table.AddColumn(Source, "Custom", each if Text.Contains([Column1], "General Population - Hazard via oral route") then [Column1] else null),
    #"Filtered Rows" = Table.SelectRows(#"Added Custom", each ([Custom] <> null)),
    #"Kept Last Rows" = Table.LastN(#"Filtered Rows", 1),
    #"Removed Other Columns" = Table.SelectColumns(#"Kept Last Rows",{"Custom"}),
    #"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Removed Other Columns", {{"Custom", Splitter.SplitTextByDelimiter("</dd><dt>", QuoteStyle.None), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Custom"),
    #"Added Custom1" = Table.AddColumn(#"Split Column by Delimiter", "Text", each Html.Table([Custom], {{"Custom",":root"}})),
    #"Expanded Text" = Table.ExpandTableColumn(#"Added Custom1", "Text", {"Custom"}, {"Custom.1"})
in
    #"Expanded Text"

CodePudding user response:

Image still looks incorrect (informationOverall is not split) but if you want to split by character transition, you can do so from the ribbon.

enter image description here

  • Related