Home > Software design >  Powershell: How to merge two arrays based on property
Powershell: How to merge two arrays based on property

Time:04-26

My first array is list of pet owners and their mobile numbers etc. Second array is pets with the information of owner name. Pet list is roughly 1k and owner list is 3k.

What is the fastest way to add the owner information into the pets information? Currently my script takes almost a minute to run which seems a bit too much.

foreach ($pet in $pets) {
    $owner = $owners | Where-Object { $_.name -eq $pet.owner }
    if ($owner) {
        $pet | Add-Member -MemberType NoteProperty -Name "Owner" -Value $owner.name    }
}

CodePudding user response:

Doing a linear lookup in the owners array for every pet is what makes your approach fundamentally slow: for every pet, 3,000 owner objects must be searched through, resulting in 1,000 x 3,000 = 3 million lookups.

Performance additionally suffers from incidental implementation choices you made:

  • As powerful and elegant as the PowerShell pipeline is, its one by one streaming is usually noticeably slower than iterating over an array in an expression / language statement, such as a foreach loop.

  • Additionally, as of PowerShell 7.2.2, the Where-Object and ForEach-Object cmdlets are inefficiently implemented, which adds additional overhead - see GitHub issue #10982 and this answer.

    • A functional limitation of Where-Object that additionally worsens performance is that there is no way to stop enumerating once a match has been found; that is, the input is invariably processed in full, and all matches are output.

    • By contrast, the analogous .Where() array method does offer a way to stop processing once the first match is found (e.g.,
      (1..10).Where({ $_ -ge 5 }, 'First'). Potentially bringing the same functionality to the Where-Object cmdlet is the subject of GitHub issue #13834.


Therefore, you have two choices:

  • (A) Pragmatic solution: stick with the fundamentally inefficient approach, but improve implementation efficiency so that the resulting performance may be good enough:

    • The (A) solution below is substantially faster than your original, Where-Object-based approach, about 33 to 38 times, depending on the PowerShell edition used; see the next section for benchmarks.
  • (B) Proper, scalable, much better-performing, but more complex solution: Use an aux. data structure that enables efficient lookup of owner objects by name, such as a hashtable, as suggested by Darin.

    • The (B) solution below is about 7 to 13 times faster than the (A) solution, depending on the PowerShell edition used, and therefore ca. 260 to 420 (!) times faster than the Where-Object solution; see the next section for benchmarks.

Note:

  • In the code below, I've modified your example so that a property other than the owner name is added as a new property to each pet object (.Address, as .OwnerAddress), given that the owner name is present to begin with.

  • Also, the -MemberType NoteProperty, -Name and -Value parameter names in the Add-Member call are omitted for brevity (they are implied).

Solution (A): Replace the Where-Object pipeline with an (inner) foreach statement:

# Create 1000 sample pets and 3000 sample owners.
$pets = foreach ($i in 1..1000) { [pscustomobject] @{ Name = "Pet $i"; Owner = 'Owner {0}' -f (6 * $i) } }
$owners = foreach ($i in 1..3000) { [pscustomobject] @{ Name = "Owner $i"; Address = "Address $i" } }

foreach ($pet in $pets) { 
  # Perform the lookup more efficiently via an inner `foreach` loop.
  $owner = foreach ($o in $owners) { if ($o.Name -eq $pet.Owner) { $o; break } }
  if ($owner) {
    Add-Member -InputObject $pet OwnerAddress $owner.Address
  }
}

Solution (B): Create a hashtable that maps owner names to owner objects, for efficient lookup:

# Create 1000 sample pets and 3000 sample owners.
$pets = foreach ($i in 1..1000) { [pscustomobject] @{ Name = "Pet $i"; Owner = 'Owner {0}' -f (6 * $i) } }
$owners = foreach ($i in 1..3000) { [pscustomobject] @{ Name = "Owner $i"; Address = "Address $i" } }

# Create a hashtable that maps owner names to owner objects,
# for efficient lookup by name.
$ownerMap = @{}; foreach ($owner in $owners) { $ownerMap[$owner.Name] = $owner }

foreach ($pet in $pets) { 
  # Look up the pet's owner in the owner map (hashtable); returns $null if not found.
  $owner = $ownerMap[$pet.Owner]
  if ($owner) {
    Add-Member -InputObject $pet OwnerAddress $owner.Address
  }
}

Benchmarks

  • Below are sample timings comparing the three approaches, averaged over 10 runs.

  • Timing commands is never an exact science in PowerShell, and performance varies based on many factors, not least the hardware with respect to absolute times, but the results below provide a sense of relative performance, as reflected in the Factor output column: 1.00 denotes the fastest command, listed first, with the slower ones expressed as multiples of it, in descending order of speed.

  • The bottom sections contains the source code, allowing you to run these benchmarks yourself

    • Caveat: With the given collection sizes, these benchmarks run for quite a while (up to 10 minutes and more), mostly due to how slow the Where-Object solution is.

    • For best results, run the benchmarks while your machine isn't (too) busy doing other things.

  • Note how performance seems to have improved substantially overall in the cross-platform PowerShell (Core) edition, relative to Windows PowerShell.

Windows PowerShell 5.1 on Windows 10:

Factor Secs (10-run avg.) Command
------ ------------------ -------
1.00   0.234              # Hashtable-assisted lookups....
6.85   1.605              # Nested foreach statements...
261.95 61.353             # Pipeline with Where-Object...

PowerShell (Core) 7.2.2 on Windows 10:

Factor Secs (10-run avg.) Command
------ ------------------ -------
1.00   0.096              # Hashtable-assisted lookups.…
12.70  1.216              # Nested foreach statements…
424.40 40.624             # Pipeline with Where-Object…

Benchmark source code:

  • The following benchmark code uses function Time-Command from this Gist.

  • Unless already present, you are prompted to automatically download and define this function in your session. (I can personally assure you that doing that is safe, but you should always check the source code yourself.)

# Download and define function `Time-Command` on demand (will prompt).
# To be safe, inspect the source code at the specified URL first.
if (-not (Get-Command -ErrorAction Ignore Time-Command)) {
  $gistUrl = 'https://gist.github.com/mklement0/9e1f13978620b09ab2d15da5535d1b27/raw/Time-Command.ps1'
  if ((Read-Host "`n====`n  OK to download and define benchmark function ``Time-Command```n  from Gist ${gistUrl}?`n=====`n(y/n)?").Trim() -notin 'y', 'yes') { Write-Warning 'Aborted.'; exit 2 }
  Invoke-RestMethod $gistUrl | Invoke-Expression
  if (-not ${function:Time-Command}) { exit 2 }
}

# Define the collection sizes
$petCount = 1000
$ownerCount = 3000

# Define a sample owners array.
$owners = foreach ($i in 1..$ownerCount) { [pscustomobject] @{ Name = "Owner $i"; Address = "Address $i" } }

# Define a script block that creates a sample pets array.
# Note: We use a script block, because the array must be re-created
#       for each run, since the pet objects get modified.
$petGenerator = {
  foreach ($i in 1..$petCount) { [pscustomobject] @{ Name = "Pet $i"; Owner = 'Owner {0}' -f (6 * $i) } }
}

# Define script blocks with the commands to time.
$commands = @(
  { # Nested foreach statements
    $pets = & $petGenerator
    foreach ($pet in $pets) { 
      $owner = foreach ($o in $owners) { if ($o.Name -eq $pet.Owner) { $o; break } }
      if ($owner) {
        Add-Member -ea stop -InputObject $pet OwnerAddress $owner.Address
      }
    }
  },
  { # Pipeline with Where-Object
    $pets = & $petGenerator
    foreach ($pet in $pets) { 
      $found = $owners | Where-Object { $_.name -eq $pet.Owner }
      if ($found) {
        Add-Member -InputObject $pet OwnerAddress $owner.Address
      }
    }
  },
  { # Hashtable-assisted lookups.
    $pets = & $petGenerator
    $ownerMap = @{}; foreach ($owner in $owners) { $ownerMap[$owner.Name] = $owner }
    foreach ($pet in $pets) { 
      $owner = $ownerMap[$pet.Owner]
      if ($owner) {
        Add-Member -InputObject $pet OwnerAddress $owner.Address
      }
    }
  }
)

Write-Verbose -Verbose 'Running benchmarks...'

# Average 10 runs.
# Add -OutputToHost to print script-block output, if desired.
Time-Command -Count 10 $commands
  • Related