I'm processing large amounts of data and after pulling the data and manipulating it, I have the results stored in memory in a variable.
I now need to separate this data into separate variables and this was easily done via piping and using a where-object, but this has slowed down now that I have much more data (1 million plus members). Note: it takes about 5 minutes.
$DCEntries = $DNSQueries | ? {$_.ClientIP -in $DCs.ipv4address -Or $_.ClientIP -eq '127.0.0.1'}
$NonDCEntries = $DNSQueries | ? {$_.ClientIP -notin $DCs.ipv4address -And $_.ClientIP -ne '127.0.0.1'}
#Note:
#$DCs is an array of 60 objects of type Microsoft.ActiveDirectory.Management.ADDomainController, with two properties: Name, ipv4address
#$DNSQueries is a collection of pscustomobjects that has 6 properties, all strings.
I immediately realize I'm enumerating $DNSQueries (the large object) twice, which is obviously costing me some time. As such I decided to go about this a different way enumerating it once and using a Switch statement, but this seems to have exponentially caused the timing to INCREASE, which is not what I was going for.
$DNSQueries | ForEach-Object {
Switch ($_) {
{$_.ClientIP -in $DCs.ipv4address -Or $_.ClientIP -eq '127.0.0.1'} {
# Query is from a DC
$DCEntries = $_
}
default {
# Query is not from DC
$NonDCEntries = $_
}
}
}
I'm wondering if someone can explain to me why the second code takes so much more time. Further, perhaps offer a better way to accomplish what I want.
Is the Foreach-Object and/or appending of the sub variables costing that much time?
CodePudding user response:
ForEach-Object
is actually the slowest way to enumerate a collection but also there is a follow-up switch
with a script block
condition causing even more overhead.
If the collection is already in memory, nothing can beat a foreach
loop for linear enumeration.
As for your biggest problem, the use of =
to add elements to an array and it being a fixed size collection. PowerShell has to create a new array and copy all elements to a new array each time a new element is added, this causes an extremely high amount of overhead. See this answer as well as this awesome documention for more details.
In this case you can combine a Collections.Generic.List<T>
with PowerShell's explicit assignment.
$NonDCEntries = [Collections.Generic.List[object]]::new()
$DCEntries = foreach($item in $DNSQueries) {
if($item.ClientIP -in $DCs.IPv4Address -Or $_.ClientIP -eq '127.0.0.1') {
$item
continue
}
$NonDCEntries.Add($item)
}
To put into perspective how exponentially bad =
to an array is, you can test this code:
$Tests = [ordered]@{
'PowerShell Explicit Assignment' = {
$result = foreach($i in 1..$count) {
$i
}
}
' = Operator to System.Array' = {
$result = @( )
foreach($i in 1..$count) {
$result = $i
}
}
'.Add(..) to List<T>' = {
$result = [Collections.Generic.List[int]]::new()
foreach($i in 1..$count) {
$result.Add($i)
}
}
}
foreach($count in 1000, 10000, 100000) {
foreach($test in $Tests.GetEnumerator()) {
$measurement = (Measure-Command { & $test.Value }).TotalMilliseconds
$totalRound = [math]::Round($measurement, 2).ToString() ' ms'
[pscustomobject]@{
CollectionSize = $count
Test = $test.Key
TotalMilliseconds = $totalRound
}
}
}
Which in my laptop yields the following results:
CollectionSize Test TotalMilliseconds
-------------- ---- -----------------
1000 PowerShell Explicit Assignment 15.9 ms
1000 = Operator to System.Array 26.88 ms
1000 .Add(..) to List<T> 12.47 ms
10000 PowerShell Explicit Assignment 1.07 ms
10000 = Operator to System.Array 2488.24 ms
10000 .Add(..) to List<T> 0.9 ms
100000 PowerShell Explicit Assignment 16.07 ms
100000 = Operator to System.Array 308931.8 ms
100000 .Add(..) to List<T> 8.39 ms