Home > other >  PowerShell inner-join large Excel table is extremely slow
PowerShell inner-join large Excel table is extremely slow

Time:09-04

I have two excel-files with over 100,000 rows. I want to merge them in PowerShell. I can quickly import them with Import-Excel. The issue is that I can't find a way to merge them that doesn't take forever.

I want an inner-join. In R or Python an inner-join takes probably less than a second and is one line of code. Is there anyway to do this in PowerShell that is quick?

For example, I tried this join-object function but it took forever: join-object

Join-Object -Left $df1 -Right $df2 -LeftJoinProperty name -RightJoinProperty name -Type OnlyIfInBoth 

Why is this so fast in other languages even with over 100,000 rows but in PowerShell it's impossibly slow? I want to use PowerShell because I don't have Python or R on some computers.

CodePudding user response:

Perhaps using this will be faster:

# create a Hashtable, which works case-insensitive by default
$right = @{}
foreach ($name in $df2.Name) { $right[$name] = $true }  # the value doesn't matter

# now loop over the left array ($df1) and output only 
# the rows where the name is found in the $right Hashtable
$result = foreach ($item in $df1) {
    if ($right.ContainsKey($item.Name)) {
        $item
    }
}

If you want to compare the names case-sensitively, you can use a HashSet:

$right = [System.Collections.Generic.HashSet[String]]::new()
foreach ($name in $df2.Name) { [void]$right.Add($name) } 

# now loop over the left array ($df1) and output only 
# the rows where the name is found in the $right HashSet
$result = foreach ($item in $df1) {
    if ($right.Contains($item.Name)) {
        $item
    }
}

You can also try if Compare-Object would be faster in your case:

$result = Compare-Object -ReferenceObject $df1 -DifferenceObject $df2 -Property Name -IncludeEqual -ExcludeDifferent -PassThru

Compare-Object also has a -CaseSensitive switch if that is what you need

CodePudding user response:

The .NET method Linq.Enumerable.Join seems to be pretty fast. As it's not immediately obvious how to use it from PowerShell (you need some casts to be able to pass ScriptBlocks to its selector parameters), I've wrapped it in a PowerShell function Join-Enumerable to make it easier to use.

Function Join-Enumerable{
    [CmdletBinding()]
    param (
        [Parameter(Mandatory)] $Outer,
        [Parameter(Mandatory)] $Inner,
        [Parameter(Mandatory)] [scriptblock] $InnerSelector,
        [Parameter(Mandatory)] [scriptblock] $OuterSelector,
        [Parameter(Mandatory)] [scriptblock] $ResultSelector
    )

    [Linq.Enumerable]::Join( $Outer, $Inner, 
        [Func[object, string]] $InnerSelector, 
        [Func[object, string]] $OuterSelector, 
        [Func[object, object, object]] $ResultSelector )
}

Demo:

$df1 = ConvertFrom-Csv @'
Name,Id
foo,1
bar,2
baz,3
'@
$df2 = ConvertFrom-Csv @'
Name,Id
bar,4
bam,5
foo,6
'@

Join-Enumerable -Outer $df1 -Inner $df2 -OuterSelector { $args.Name } -InnerSelector { $args.Name } -ResultSelector {
    [PSCustomObject] @{ 
        Name = $args[0].Name
        Id1  = $args[0].Id
        Id2  = $args[1].Id
    }
}

Output:

Name Id1 Id2
---- --- ---
foo  1   6
bar  2   4

Remarks:

  • Parameters -Outer and -Inner specify the two arrays (more exactly anything that implements the IEnumerable interface) to join
  • Parameter -OuterSelector specifies how to get the key to join on from an element of outer array
  • Parameter -InnerSelector specifies how to get the key to join on from an element of inner array
  • Parameter -ResultSelector specifies how to generate each output object
  • For maximum performance, outer, inner and result should use class objects instead of PSCustomObject.

For testing performance of 100000 elements, I used the this code. It creates two arrays of objects with 100000 elements each, with random name properties. The name properties of the 2nd array match elements from the 1st array randomly. So this is worst case 1:1 joining, which should give naive algorithms a hard time.

On my machine, it took about 1.7 s when using class objects and about 4.5 s when using PSCustomObjects. Doesn't seem to bad. When I find some time, I will do comparison with other algorithms.

  • Related