I'm just curious if I'm missing any documentation, or if there is a different/better way to do this that negates the need for documentation. Maybe I'm the only one trying to use Select-Object
to select the -First X
unique instances from a set of data.
Based on the testing below, it looks like using Select-Object
with the -Unique
switch and some type of limiter (First
, Last
, Skip
, Index
, etc.) inherently causes the limiter to be applied BEFORE removing duplicates. This doesn't make sense to me conceptually, but also doesn't appear to be documented.
I apologize for the poor example, but consider an array of 20 items with each item appearing twice:
PS > $array = @() ; 1..10 | % { $array = $_ ; $array = $_ }
PS > $array -Join ','
1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10 ##Displaying the array on a single comma separated line
Let's say that someone gives you $array
, but you can only handle a maximum input of 5 objects. Filtering down what you're given, you might be tempted to use Select-Object
. At first you end up with 5 objects, but there are duplicates, so quick thinking you simply add the -Unique
switch and then you realize that the output still isn't quite right.
PS > ($array | Select-Object -First 5) -Join ','
1,1,2,2,3 ##5 objects as expected, but with duplicates
PS > ($array | Select-Object -Unique -First 5) -Join ','
1,2,3 ##No duplicates, but less than the expected 5 objects...
To get the outcome I was expecting, I'd need Select-Object
to remove the duplicates prior to returning the final set of objects. While there is nothing wrong in knowing this, it seems strange to me that the Select-Object
uses the order of operations that it does and also that there isn't any documentation around the fact that the -Unique
switch is applied at the end of the cmdlet
.
PS > ($array | Select-Object -Unique | Select-Object -First 5) -Join ','
1,2,3,4,5 ##This is my expected outcome, 5 objects returned without any duplicates
CodePudding user response:
Indeed, the -First
/ -Last
/ -Skip
/ -Index
/ -SkipIndex
/ -SkipLast
parameters apply to the original input first, and -Unique
is applied to the resulting output.
The simple workaround is to use two Select-Object
calls: one that finds the unique objects, and another that selects the desired number from among the unique ones:
PS> 1, 1, 2, 3 | Select-Object -Unique | Select-Object -First 2
1
2
Note:
While the extra
Select-Object
call does add processing overhead, the command overall has the potential to only processes only as many input objects as needed, i.e. to stop processing once the desired number of unique objects have been found.However, as of PowerShell 7.2, it seems that
Select-Object -Unique
is implemented inefficiently and unexpectedly collects all input first before producing output, even though there's no conceptual reason to do so: it should be able to produce streaming output, i.e. to - conditionally - output input objects as they're being received, because it only needs to consider what input objects have been received so far.This contrasts with
Sort-Object
, which also offers a-Unique
switch, which of necessity must collect all input first before producing output, because all input objects must be considered for proper sorting.As of PowerShell 7.2,
Sort-Object -Unique
is much faster in practice thanSelect-Object -Unique
.As for how
Select-Object -Unique
could be implemented in a more efficient, streaming manner: The objects seen so far could be stored in aSystem.Collections.Generic.HashSet`1
instance to facilitate an efficient test for whether an input object is considered equal to one that has already been output; see this answer for a PowerShell example.
If and when
Select-Object -Unique
is fixed, the tradeoff is as follows:The smaller the proportion of the output objects of interest is to in relation to all input objects, the better off you are using
Select-Object -Unique
(even if you have to sort the resulting objects afterwards).If you need to output / consider all input objects anyway, and assuming that outputting the objects of interest in sort order is desired / acceptable,
Sort-Object
is the better choice.
Testing whether a cmdlet produces streaming output or collects all input first:
Short of examining a cmdlet's source code, here's a way to test - the middle pipeline segment is the command to test:
# Test Sort-Object -Unique
# Because the command cannot stream, for conceptual reasons,
# it takes a while for the one and only output object to appear.
1..1e5 | Sort-Object -Unique | Select-Object -First 1
# Test Select-Object -Unique
# The command *could* stream, conceptually speaking, in which case
# the output object would appear right away.
# However, as of PowerShell 7.2, the command isn't implemented
# in a streaming fashion, so it takes a - surprisingly long - while
# for the output object to appear.
# it takes a while for the one and only output object to appear.
1..1e5 | Select-Object -Unique | Select-Object -First 1
If the given pipeline above produces its one and only output object near instantly, the command of interest is streaming; if it takes a while before the output object appears, it collects all input first.