Home > Software design >  Extract numbers from text by minimum length of number string using power query
Extract numbers from text by minimum length of number string using power query

Time:10-06

Problem I've been tasked to tidy up some very messy data containing a mix of text and numbers and wish to use power query to separate codes from the data. Fortunately, the codes that need separating consist of only numerical values and appear to be 7 characters in length (let's say 6 or greater).

Below is an example of how I wish to separate the data:

enter image description here

So Far: So Far I have this code:

let
Source = Excel.CurrentWorkbook(){[Name="Table5"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Input ", type text}}),
#"Replaced Value" = Table.ReplaceValue(#"Changed Type","_"," ",Replacer.ReplaceText,{"Input "}),
#"Replaced Value1" = Table.ReplaceValue(#"Replaced Value","v"," ",Replacer.ReplaceText,{"Input "}),
#"Added Custom" = Table.AddColumn(#"Replaced Value1", "TextSplit", each Text.Split([#"Input "], " ")),
#"Added Custom1" = Table.AddColumn(#"Added Custom", "ListTransform", each List.Transform([TextSplit], each Text.Select(_,{"0".."9"}))),
#"Added Custom2" = Table.AddColumn(#"Added Custom1", "ListSelect", each List.Select([ListTransform], each Text.Length(_)>=5)),
#"Added Custom3" = Table.AddColumn(#"Added Custom2", "TextCombine", each Text.Combine([ListSelect], ", ")),
#"Removed Columns" = Table.RemoveColumns(#"Added Custom3",{"TextSplit", "ListTransform", "ListSelect"})

in #"Removed Columns"

which does appear to solve the problem. However, in cases like 0102646v2.0 this is pulled through as 010264620. In order to get it to work, I have had to introduce steps to replace _ and "v" with " ". Is it not possible for Power Query to recognize that say 0102646v2.0 should be extracted as 0102646?

Data:

Input Values:
3159087 v1.0
3194070 v1.0
#8102368 V3.0 (Shine and ProtectR18)
#8102371 V4.0 (Lemon 12A Degreaser)
Marine (FF3080300 v1.0)
Green Apple (FF3080301 v1.0)
0102646v2.0 (Fresh Cotton)
TDS# 3129801 V1.0 GPA Code#3123402
FF3112964 0.1 FF3145524 0.1_3152912 0.1

Thank you!

CodePudding user response:

Here's a PQ implementation of extracting patterns using Regex and returning them comma separated:

Add this as a custom function. I named it fnRegexExtr

//see http://www.thebiccountant.com/2018/04/25/regex-in-power-bi-and-power-query-in-excel-with-java-script/
// and https://gist.github.com/Hugoberry/4948d96b45d6799c47b4b9fa1b08eadf

let   fx=(text,regex)=>
    Web.Page(
        "<script>
            var x='"&text&"';
            var y=new RegExp('"&regex&"','g');
            var b=x.match(y);
            document.write(b);
        </script>")[Data]{0}[Children]{0}[Children]{1}[Text]{0}

in
fx

You can then use it in your code like this:

let
    Source = Excel.CurrentWorkbook(){[Name="Table10"]}[Content],
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Input", type text}}),
    #"Added Custom" = Table.AddColumn(#"Changed Type", "Output", 
        each fnRegexExtr([Input], "[0-9]{6,}"))
in
    #"Added Custom"

which returns:

enter image description here

CodePudding user response:

You can extract numbers from a text using the below code:

= Table.AddColumn(#"PreviousStep", "MyNumberColumn", each Text.Select([Input], {"0".."9"}))

this will give you all the numbers in the same order in the string. And if your actual numbers are exactly 7 characters in length, you should be able to se Excel formula to extract that as well.

Or if you prefer splitting the text using Power Query, you can

  1. split the column using number of characters and then (Home > Split Column (drop-down) > By Number of Characters)
  2. merge the columns using a delimiter (Select columns > Go to Add Column tab > Merge Columns)

Note: you need to edit the PreviousStep and Input based on your query

  • Related