Home > Software design >  How to use PowerQuery to randomly select a proportion of rows for each unique value in a specific co
How to use PowerQuery to randomly select a proportion of rows for each unique value in a specific co

Time:11-10

I was doing all this in Python, but I'm leaving the company and I want to leave behind a process that someone even less tech-y than I can manage.

I've written instructions on how to consolidate several hundred Excel worksheets using Power Query (Get & Transform Data) in Excel. I then have them 'close & load' into an Excel worksheet...the data consists of about 9000 rows with >20 columns. One column, called 'Name', = colleague name (there's about 20 unique names).

Now I need to get a random 10% of each colleague's rows. (In Python I used groupby & sample, here.) I can do this in the Excel worksheet by adding a 'helper row' called 'Rand No' that contains the formula =RAND(), and then using this array formula (Control Shift Enter): =IF([@[Rand No]]>=LARGE((--([@Name]=[Name])*[Rand No]),COUNTIF([Name],[@Name])/10),"Randomised",""). But I was just wondering how you could do this randomisation in PowerQuery, since we already had the PQ editor open when we consolidated all those sheets. I'm guessing it would be more complicated than just doing it with an ordinary Excel formula--but am I wrong?

CodePudding user response:

Adding a random number in PQ doesn't work most times without both buffering and indexing. Otherwise you either get the same random number over and over for each row, or they keep changing as you are using them

Here's a basic query to do the top 10% selection using below function

It groups on Name, adds a sorted random number for each row using the function, then takes the top10% and expands.

let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Grouped Rows" = Table.Group(Source, {"Name"}, {{"Data", each _, type table}}),
#"Added Custom" = Table.AddColumn(#"Grouped Rows", "Custom", each chg([Data])),
#"Added Custom1" = Table.AddColumn(#"Added Custom", "Custom.1", each Table.FirstN([Custom],Number.IntegerDivide(Table.RowCount([Custom]),10))),
#"Removed Columns" = Table.RemoveColumns(#"Added Custom1",{"Data", "Custom"}),
// change to be your own expand    
#"Expanded Custom.1" = Table.ExpandTableColumn(#"Removed Columns", "Custom.1", {"Col2", "Col3", "Col4"}, {"Col2", "Col3", "Col4"})
in  #"Expanded Custom.1"

function chg

// function;  name me as chg
(SourceTable as table) =>
 //  based on https://social.technet.microsoft.com/Forums/en-US/af93118d-98c0-4fa7-bf4c-06765ca7c770/random-sample-in-power-query     
let AddedIndex = Table.AddIndexColumn(SourceTable, "Index", 0, 1),
BufferedRandomNumberList = List.Buffer(List.Random(Table.RowCount(AddedIndex))),
AddedCustom = Table.AddColumn(AddedIndex, "Random", each BufferedRandomNumberList{[Index]}),
Table.AddRandomNumberColumn = Table.RemoveColumns(AddedCustom,{"Index"})
in Table.Sort(Table.AddRandomNumberColumn, {"Random"})

CodePudding user response:

Here is a slightly different approach.

  • I create a List of Random Numbers of the same count as the number of table rows
  • Add that List as a column to the original table
  • Then group by Name, sort each subgroup by the random column, and remove all but N% of the rows.

This will usually provide different output for each time the query is refreshed.

In my query, I generated the data, as a list of Names and List of Values, but you can easily read the data from any source.

M Code

let

//read in the table.
// I created a table, but you can obtain it from any source
    Source = Table.FromColumns({
                List.Repeat({"A".."Z"},10), List.Numbers(0,260)},
                type table[Name=text,Value=Int64.Type]),

//Add buffered random number column to the table
    random = List.Buffer(List.Random(Table.RowCount(Source))),
    newTbl = Table.FromColumns(
            Table.ToColumns(Source) & {random},
            type table [Name=text, Value=Int64.Type, Random=number]
    ),

//Group by Name, then extract the top N by sorting and removing (1-N) rows
    N=0.1,

    grp = Table.Group(newTbl,"Name",{{
        "group", each Table.RemoveRows(Table.Sort(_,{"Random",Order.Descending}),0, Table.RowCount(_)*(1-N))
    }}),

//Expand the Value Column
//  You may need to expand more columns -- just don't check Name or Random
    #"Expanded group" = Table.ExpandTableColumn(grp, "group", {"Value"}, {"Value"})
in
    #"Expanded group"
  • Related