Home > database >  PowerShell Select-Object: Using -Unique with First/Last/Skip/Index
PowerShell Select-Object: Using -Unique with First/Last/Skip/Index

Time:10-15

I'm just curious if I'm missing any documentation, or if there is a different/better way to do this that negates the need for documentation. Maybe I'm the only one trying to use Select-Object to select the -First X unique instances from a set of data.

Based on the testing below, it looks like using Select-Object with the -Unique switch and some type of limiter (First, Last, Skip, Index, etc.) inherently causes the limiter to be applied BEFORE removing duplicates. This doesn't make sense to me conceptually, but also doesn't appear to be documented.

I apologize for the poor example, but consider an array of 20 items with each item appearing twice:

PS > $array = @() ; 1..10 | % { $array  = $_ ; $array  = $_ }
PS > $array -Join ','
1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10  ##Displaying the array on a single comma separated line

Let's say that someone gives you $array, but you can only handle a maximum input of 5 objects. Filtering down what you're given, you might be tempted to use Select-Object. At first you end up with 5 objects, but there are duplicates, so quick thinking you simply add the -Unique switch and then you realize that the output still isn't quite right.

PS > ($array | Select-Object -First 5) -Join ','
1,1,2,2,3  ##5 objects as expected, but with duplicates
PS > ($array | Select-Object -Unique -First 5) -Join ','
1,2,3  ##No duplicates, but less than the expected 5 objects...

To get the outcome I was expecting, I'd need Select-Object to remove the duplicates prior to returning the final set of objects. While there is nothing wrong in knowing this, it seems strange to me that the Select-Object uses the order of operations that it does and also that there isn't any documentation around the fact that the -Unique switch is applied at the end of the cmdlet.

PS > ($array | Select-Object -Unique | Select-Object -First 5) -Join ','
1,2,3,4,5  ##This is my expected outcome, 5 objects returned without any duplicates

CodePudding user response:

Indeed, the -First / -Last / -Skip / -Index / -SkipIndex / -SkipLast parameters apply to the original input first, and -Unique is applied to the resulting output.

The simple workaround is to use two Select-Object calls: one that finds the unique objects, and another that selects the desired number from among the unique ones:

PS> 1, 1, 2, 3 | Select-Object -Unique | Select-Object -First 2
1
2

Note:

  • While the extra Select-Object call does add processing overhead, the command overall has the potential to only processes only as many input objects as needed, i.e. to stop processing once the desired number of unique objects have been found.

  • However, as of PowerShell 7.2, it seems that Select-Object -Unique is implemented inefficiently and unexpectedly collects all input first before producing output, even though there's no conceptual reason to do so: it should be able to produce streaming output, i.e. to - conditionally - output input objects as they're being received, because it only needs to consider what input objects have been received so far.

    • This contrasts with Sort-Object, which also offers a -Unique switch, which of necessity must collect all input first before producing output, because all input objects must be considered for proper sorting.

    • As of PowerShell 7.2, Sort-Object -Unique is much faster in practice than Select-Object -Unique.

    • As for how Select-Object -Unique could be implemented in a more efficient, streaming manner: The objects seen so far could be stored in a System.Collections.Generic.HashSet`1 instance to facilitate an efficient test for whether an input object is considered equal to one that has already been output; see this answer for a PowerShell example.

  • If and when Select-Object -Unique is fixed, the tradeoff is as follows:

    • The smaller the proportion of the output objects of interest is to in relation to all input objects, the better off you are using Select-Object -Unique (even if you have to sort the resulting objects afterwards).

    • If you need to output / consider all input objects anyway, and assuming that outputting the objects of interest in sort order is desired / acceptable, Sort-Object is the better choice.


Testing whether a cmdlet produces streaming output or collects all input first:

Short of examining a cmdlet's source code, here's a way to test - the middle pipeline segment is the command to test:

# Test Sort-Object -Unique
# Because the command cannot stream, for conceptual reasons, 
# it takes a while for the one and only output object to appear.
1..1e5 | Sort-Object -Unique | Select-Object -First 1
# Test Select-Object -Unique
# The command *could* stream, conceptually speaking, in which case
# the output object would appear right away.
# However, as of PowerShell 7.2, the command isn't implemented
# in a streaming fashion, so it takes a - surprisingly long - while
# for the output object to appear.
# it takes a while for the one and only output object to appear.
1..1e5 | Select-Object -Unique | Select-Object -First 1

If the given pipeline above produces its one and only output object near instantly, the command of interest is streaming; if it takes a while before the output object appears, it collects all input first.

  • Related