Home > Blockchain >  Html.Table Equivalent in Power query Using Regex
Html.Table Equivalent in Power query Using Regex

Time:08-03

So Power Query doesn't have the Html.Table Function that can be found in Power BI.

So can we use regex to convert Html into text and make an equivalent function?

Previous posts state that this shouldn't be done since HTML doesn't follow the same rules as text, however .. needs must. Its also just interesting as a question and if achievable, would prove very useful for scraping difficult pages, directly in excel.

I came across this regex:

enter image description here

Currently, submitting this regex into FnRegexReplace Function results in an error. I don't know if the regex can be read correctly by excel and, if not, if there are any work arounds.

FnRegexReplace: Note: y = Text.Replace(y,"\","\\"), so no need for \\

(x,y,z)=>
let 
   
   y = Text.Replace(y,"\","\\"),
   
   Source = Web.Page(
                     "<script>var x="&"'"&x&"'"&";var z="&"'"&z&
                     "'"&";var y=new RegExp('"&y&"','gmi');
                     var b=x.replace(y,z);document.write(b);</script>")
                     [Data]{0}[Children]{0}[Children]{1}[Text]{0}
in 
   Source

M Code:

let
    Source = Excel.CurrentWorkbook(){[Name="Table2"]}[Content],
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}}),
    #"Invoked Custom Function" = Table.AddColumn(#"Changed Type", "FnRegexReplace", each FnRegexReplace([Column1], "<([\w\-\/] )(  [\w\-] (=(('[^']*')|(""[^""]*"")))?)* *>", " "))
in
    #"Invoked Custom Function"

HTML DATA:

    </div><!-- SectionHeaderWrapper --><div id="SectionContent"><h3 id="sAdministrativeDataSummary" >Administrative data</h3><h3 id="sWorkersHazardViaInhalationRoute">Workers - Hazard via inhalation route</h3><h4>Systemic effects</h4><h5>Long term exposure</h5><dl ><dt>Hazard assessment conclusion:</dt><dd>no hazard identified</dd></dl></dl></dl></dl></dl></dl><h5>Acute/short term exposure</h5><dl ><dt>Hazard assessment conclusion:</dt><dd>no hazard identified</dd></dl><h6>DNEL related information</h6></dl></dl></dl></dl></dl><h4>Local effects</h4><h5>Long term exposure</h5><dl ><dt>Hazard assessment conclusion:</dt><dd>DNEL (Derived No Effect Level)
Value:</dt><dd><span >0.02</span> mg/m&sup3;
Most sensitive endpoint:</dt><dd>repeated dose toxicity</dd></dl><h6>DNEL related information</h6><dl ><dt>DNEL derivation method:</dt><dd>other: <span >Biocidal Products Regulation guidance for Human Health Risk Assessment (Volume III, Part B, December 2013</span>
Overall assessment factor (AF):</dt><dd >16
Dose descriptor:</dt><dd>NOAEC
Value:</dt><dd><span >0.34</span> mg/m&sup3;
AF for dose response relationship:</dt><dd >1
Justification:</dt><dd >NOAEC defined based on local effects of irritation/corrosion which are considered concentration dependent
AF for differences in duration of exposure:</dt><dd >2
Justification:</dt><dd >NOAEC derived from subchronic study therefore extrapolating to chronic duration
AF for interspecies differences (allometric scaling):</dt><dd >2.5
Justification:</dt><dd >Local effects observed only therefore toxicokinetics do not contribute to interspecies differences
AF for other interspecies differences:</dt><dd >1
Justification:</dt><dd >Local effects observed only therefore toxicokinetics do not contribute to interspecies differences
AF for intraspecies differences:</dt><dd >3.2
Justification:</dt><dd >Local effects observed only therefore toxicokinetics do not contribute to intraspecies differences
AF for the quality of the whole database:</dt><dd >1
Justification:</dt><dd >Hazards well characterised in multiple studies of good reliability 
AF for remaining uncertainties:</dt><dd >1
Justification:</dt><dd >No remaining uncertainties</dd></dl></dl></dl></dl></dl><h5>Acute/short term exposure</h5><dl ><dt>Hazard assessment conclusion:</dt><dd>DNEL (Derived No Effect Level)

CodePudding user response:

One answer so Far: (<[^<>]*>) however I am not sure how well this works for other HTML text

CodePudding user response:

For fun, since recursion seemed boring, the most convoluted way I can think to do this

//single column of HTML text as input into [Column1]
// removes all text betweeen all pairs of < and >
let Source = Csv.Document(File.Contents("C:\Temp\a.txt")),
#"Added Custom1" = Table.AddColumn(Source, "Custom.1", each Text.PositionOfAny([Column1],{"<"},99)),
#"Added Custom" = Table.AddColumn(#"Added Custom1", "Custom", each Text.PositionOfAny([Column1],{">"},99)),
#"Added Custom2" = Table.AddColumn(#"Added Custom", "Custom.2", each List.Zip({[Custom.1],[Custom]})),
#"Added Custom3" = Table.AddColumn(#"Added Custom2", "Custom.3", each List.Positions([Custom.1])),
#"Added Custom4" = Table.AddColumn(#"Added Custom3", "Custom.4", each List.Transform([Custom.3], let x=[Custom.1], y=[Custom] in each {x{_}..y{_}})),
#"Added Custom5" = Table.AddColumn(#"Added Custom4", "Combined", each List.Combine([Custom.4])),
#"Added Custom6" = Table.AddColumn(#"Added Custom5", "TextToList", each Text.ToList([Column1])),
#"Added Custom7" = Table.AddColumn(#"Added Custom6", "Positions", each List.Positions([TextToList])),
#"Added Custom8" = Table.AddColumn(#"Added Custom7", "Custom.8", each List.Transform([Positions], let x=[Combined], y= [TextToList], z=[Custom.1] in each  if List.Contains(z,_) then " "  else if List.Contains(x,_) then null else y{_})),
#"Added Custom9" = Table.AddColumn(#"Added Custom8", "Custom.5", each Text.Combine(List.RemoveNulls([Custom.8]))),
#"Replaced Value" = Table.ReplaceValue(#"Added Custom9","  "," ",Replacer.ReplaceText,{"Custom.5"}),
#"Added Custom10" = Table.AddColumn(#"Replaced Value", "Custom.6", each Text.Trim([Custom.5])),
#"Removed Columns" = Table.RemoveColumns(#"Added Custom10",{"Column1", "Custom.1", "Custom", "Custom.2", "Custom.3", "Custom.4", "Combined", "TextToList", "Positions", "Custom.8", "Custom.5"})
in #"Removed Columns"

then you'd probably go back and replace all HTML entities like

Entities

&amp;
&sup3;
  • Related