I've written a script to help me identify duplicate files in my large movie collection. For some reason if I split these commands and export/import to CSV it runs much faster than if I leave everything in memory. Here is my original code, it is god-awful slow:
Get-ChildItem M:\Movies\ -recurse | where-object {$_.length -gt 524288000} | select-object Directory, Name | Group-Object directory | ?{$_.count -gt 1} | %{$_.Group} | export-csv -notypeinformation M:\Misc\Scripts\Duplicatemovies.csv
If I split this into 2 commands and export to CSV in the middle it runs about 100x faster. I'm hoping someone could shed some light on what I'm doing wrong.
Get-ChildItem M:\Movies\ -recurse | where-object {$_.length -gt 524288000} | select-object Directory, Name | Export-Csv -notypeinformation M:\Misc\Scripts\DuplicateMovies\4.csv
import-csv M:\Misc\Scripts\DuplicateMovies\4.csv | Group-Object directory | ?{$_.count -gt 1} | %{$_.Group} | export-csv -notypeinformation M:\Misc\Scripts\DuplicateMovies\Duplicatemovies.csv
remove-item M:\Misc\Scripts\DuplicateMovies\4.csv
appreciate any suggestions,
~TJ
CodePudding user response:
It's not Group-Object
that is slow, it's your grouping condition, you're asking it to group FileInfo
objects by their .Directory
property which represents their parent folder DirectoryInfo
instance. So, you're asking the cmdlet to group objects by a very complex object as a grouping condition, instead you could use the .DirectoryName
property as your grouping condition, which represents the parent directory's FullName
property (a simple string) or you could use the .Directory.Name
property which represents the parent's folder Name
(also a simple string).
To summarize, the main reason why exporting to a CSV is faster in this case, is because when Export-Csv
receives your objects from pipeline, it calls the ToString()
method on each object's property values, hence the Directory
instance gets converted to its string representation (calling ToString()
to this instance ends up being the folder's FullName
).
As for your code, if you want to keep as efficient as possible without actually overcomplicating it:
Get-ChildItem M:\Movies\ -Recurse -File | & {
process {
if($_.Length -gt 500mb) { $_ }
}
} | Group-Object DirectoryName | & {
process {
if($_.Count -gt 2) {
foreach($object in $_.Group) {
[pscustomobject]@{
Directory = $_.Name # => This is the Parent Directory FullName
Name = $object.Name
}
}
}
}
} | Export-Csv M:\Misc\Scripts\DuplicateMovies\4.csv -NoTypeInformation
If you want to group them by the Parent Name
instead of FullName
, you could use:
Group-Object { $_.Directory.Name }