I have a large data array (Invoke-Sqlcmd export, array variable), lets say it contains all of our 'customers'. I have large data list (Local search results, string variable), lets say it contains all of our 'paying customers' I want to create a final array by subtracting the people in the list from the array, leaving me with 'customers who have not paid yet'. The code is doing exactly this but runs very slow, Im hoping to educate myeself on a quicker way.
The size of the first array is around 25,000 and the size of the list is around 20,000. Because the list is just a plain text variable, I first split it on new lines to allow the ForEach to take place (otherwise it registers as one object) The code im using to do this is:
$NewArray = $FirstArray
ForEach ($Customer In $List)
{
$NewArray = $NewArray | ? {$_.CustomerID -ne $Customer}
}
any help greatly appreciated, Thanks
CodePudding user response:
I've done a bit of performance testing, and by far and away the fastest approach is @zett42's suggestion of converting the customer list into a hashtable.
Test Data
First, here's a little snippet to set up some test data. It has some unrealistic properties like being sorted by id, with the first 80% being paid customers and the last 20% unpaid, but I don't think that will affect the results much.
# build some test data
$count = 25000
$allCustomers = @(
1..$count | foreach-object {
[pscustomobject] [ordered] @{
"CustomerID" = $_
}
}
);
$paidIds = [int[]] @( 1..($count * 0.8) )
Original Approach
Here's the OP's approach as a baseline:
Measure-Command {
$NewArray = @( $allCustomers );
foreach( $paidId In $paidIds )
{
$NewArray = $NewArray | where-object { $_.CustomerID -ne $paidId }
}
}
# TotalMilliseconds : 37732.532
Hashtable
And here's @zett42's suggestion of using a hashtable to do fast removal of customers by CustomerId when we find them in the paid list:
Measure-Command {
# convert customers into a hashtable for fast removal by id
$lookups = @{};
foreach( $customer in $allCustomers )
{
$lookups.Add($customer.CustomerId, $customer)
}
# remove all the paid customer ids
foreach( $paidId In $paidIds )
{
$lookups.Remove($paidId)
}
# get the remaining unpaid customers
$NewArray = @( $lookups.Values )
}
# TotalMilliseconds : 53.9363
-notin
@AbrahamZinala's suggestion of -notin
also works pretty well if you want a quick one-liner and don't mind not having the raw blazing speed of the hashtable approach.
Measure-Command {
$NewArray = $allCustomers | where-object { $_.CustomerId -notin $paidIds };
}
# TotalMilliseconds : 698.2812
The hashtable approach scales better with larger datasets, so if you don't mind the little bit of extra setup that might be the one to go for...