I have two excel-files with over 100,000 rows. I want to merge them in PowerShell. I can quickly import them with Import-Excel. The issue is that I can't find a way to merge them that doesn't take forever.
I want an inner-join. In R or Python an inner-join takes probably less than a second and is one line of code. Is there anyway to do this in PowerShell that is quick?
For example, I tried this join-object
function but it took forever: join-object
Join-Object -Left $df1 -Right $df2 -LeftJoinProperty name -RightJoinProperty name -Type OnlyIfInBoth
Why is this so fast in other languages even with over 100,000 rows but in PowerShell it's impossibly slow? I want to use PowerShell because I don't have Python or R on some computers.
CodePudding user response:
Perhaps using this will be faster:
# create a Hashtable, which works case-insensitive by default
$right = @{}
foreach ($name in $df2.Name) { $right[$name] = $true } # the value doesn't matter
# now loop over the left array ($df1) and output only
# the rows where the name is found in the $right Hashtable
$result = foreach ($item in $df1) {
if ($right.ContainsKey($item.Name)) {
$item
}
}
If you want to compare the names case-sensitively, you can use a HashSet:
$right = [System.Collections.Generic.HashSet[String]]::new()
foreach ($name in $df2.Name) { [void]$right.Add($name) }
# now loop over the left array ($df1) and output only
# the rows where the name is found in the $right HashSet
$result = foreach ($item in $df1) {
if ($right.Contains($item.Name)) {
$item
}
}
You can also try if Compare-Object would be faster in your case:
$result = Compare-Object -ReferenceObject $df1 -DifferenceObject $df2 -Property Name -IncludeEqual -ExcludeDifferent -PassThru
Compare-Object also has a -CaseSensitive
switch if that is what you need
CodePudding user response:
The .NET method Linq.Enumerable.Join seems to be pretty fast. As it's not immediately obvious how to use it from PowerShell (you need some casts to be able to pass ScriptBlock
s to its selector parameters), I've wrapped it in a PowerShell function Join-Enumerable
to make it easier to use.
Function Join-Enumerable{
[CmdletBinding()]
param (
[Parameter(Mandatory)] $Outer,
[Parameter(Mandatory)] $Inner,
[Parameter(Mandatory)] [scriptblock] $InnerSelector,
[Parameter(Mandatory)] [scriptblock] $OuterSelector,
[Parameter(Mandatory)] [scriptblock] $ResultSelector
)
[Linq.Enumerable]::Join( $Outer, $Inner,
[Func[object, string]] $InnerSelector,
[Func[object, string]] $OuterSelector,
[Func[object, object, object]] $ResultSelector )
}
Demo:
$df1 = ConvertFrom-Csv @'
Name,Id
foo,1
bar,2
baz,3
'@
$df2 = ConvertFrom-Csv @'
Name,Id
bar,4
bam,5
foo,6
'@
Join-Enumerable -Outer $df1 -Inner $df2 -OuterSelector { $args.Name } -InnerSelector { $args.Name } -ResultSelector {
[PSCustomObject] @{
Name = $args[0].Name
Id1 = $args[0].Id
Id2 = $args[1].Id
}
}
Output:
Name Id1 Id2
---- --- ---
foo 1 6
bar 2 4
Remarks:
- Parameters
-Outer
and-Inner
specify the two arrays (more exactly anything that implements theIEnumerable
interface) to join - Parameter
-OuterSelector
specifies how to get the key to join on from an element of outer array - Parameter
-InnerSelector
specifies how to get the key to join on from an element of inner array - Parameter
-ResultSelector
specifies how to generate each output object - For maximum performance, outer, inner and result should use class objects instead of
PSCustomObject
.
For testing performance of 100000 elements, I used the this code. It creates two arrays of objects with 100000 elements each, with random name
properties. The name
properties of the 2nd array match elements from the 1st array randomly. So this is worst case 1:1 joining, which should give naive algorithms a hard time.
On my machine, it took about 1.7 s when using class objects and about 4.5 s when using PSCustomObject
s. Doesn't seem to bad. When I find some time, I will do comparison with other algorithms.