Home > Net >  PowerShell - Group-Object Poor Performance
PowerShell - Group-Object Poor Performance

Time:10-19

I've written a script to help me identify duplicate files in my large movie collection. For some reason if I split these commands and export/import to CSV it runs much faster than if I leave everything in memory. Here is my original code, it is god-awful slow:

Get-ChildItem M:\Movies\ -recurse | where-object {$_.length -gt 524288000} | select-object Directory, Name | Group-Object directory | ?{$_.count -gt 1} | %{$_.Group} | export-csv -notypeinformation M:\Misc\Scripts\Duplicatemovies.csv

If I split this into 2 commands and export to CSV in the middle it runs about 100x faster. I'm hoping someone could shed some light on what I'm doing wrong.

Get-ChildItem M:\Movies\ -recurse | where-object {$_.length -gt 524288000} | select-object Directory, Name | Export-Csv -notypeinformation M:\Misc\Scripts\DuplicateMovies\4.csv

import-csv M:\Misc\Scripts\DuplicateMovies\4.csv | Group-Object directory | ?{$_.count -gt 1} | %{$_.Group} | export-csv -notypeinformation M:\Misc\Scripts\DuplicateMovies\Duplicatemovies.csv

remove-item M:\Misc\Scripts\DuplicateMovies\4.csv

appreciate any suggestions,

~TJ

CodePudding user response:

It's not Group-Object that is slow, it's your grouping condition, you're asking it to group FileInfo objects by their .Directory property which represents their parent folder DirectoryInfo instance. So, you're asking the cmdlet to group objects by a very complex object as a grouping condition, instead you could use the .DirectoryName property as your grouping condition, which represents the parent directory's FullName property (a simple string) or you could use the .Directory.Name property which represents the parent's folder Name (also a simple string).

To summarize, the main reason why exporting to a CSV is faster in this case, is because when Export-Csv receives your objects from pipeline, it calls the ToString() method on each object's property values, hence the Directory instance gets converted to its string representation (calling ToString() to this instance ends up being the folder's FullName).

As for your code, if you want to keep as efficient as possible without actually overcomplicating it:

Get-ChildItem M:\Movies\ -Recurse -File | & {
    process {
        if($_.Length -gt 500mb) { $_ }
    }
} | Group-Object DirectoryName | & {
    process {
        if($_.Count -gt 2) {
            foreach($object in $_.Group) {
                [pscustomobject]@{
                    Directory = $_.Name # => This is the Parent Directory FullName
                    Name      = $object.Name
                }
            }
        }
    }
} | Export-Csv M:\Misc\Scripts\DuplicateMovies\4.csv -NoTypeInformation

If you want to group them by the Parent Name instead of FullName, you could use:

Group-Object { $_.Directory.Name }
  • Related