How can I improve the speed and memory usage of calculating the size of the N largest files?-CodePudding

I am getting the total number of bytes of the 32 largest files in the folder:

$big32 = Get-ChildItem c:\\temp -recurse |
    Sort-Object length -descending |
    select-object -first 32 |
    measure-object -property length –sum

$big32.sum /1gb

However, it's working very slowly. We have about 10 TB of data in 1.4 million files.

CodePudding user response：

The following implements improvements by only using PowerShell cmdlets. Using System.IO.Directory.EnumerateFiles() as a basis as suggested by this answer might give another performance improvement but you should do your own measurements to compare.

(Get-ChildItem c:\temp -Recurse -File).ForEach('Length') | 
    Sort-Object -Descending -Top 32 | 
    Measure-Object -Sum

This should reduce memory consumption considerably as it only sorts an array of numbers instead of an array of FileInfo objects. Maybe it's also somewhat faster due to better caching (an array of numbers is stored in a contiguous, cache-friendly block of memory, whereas an array of objects only stores the references in a contiguous way, but the objects themselfs can be scattered all around in memory).

Note the use of .ForEach('Length') instead of just .Length because of member enumeration ambiguity.

By using Sort-Object parameter -Top we can get rid of the Select-Object cmdlet, further reducing pipeline overhead.

CodePudding user response：

I can think of some improvements, especially to memory usage but following should be considerable faster than Get-ChildItem

[System.IO.Directory]::EnumerateFiles('c:\temp', '*.*', [System.IO.SearchOption]::AllDirectories) | 
    Foreach-Object {
        [PSCustomObject]@{
            filename = $_
            length = [System.IO.FileInfo]::New($_).Length
        }
    } | 
    Sort-Object length -Descending | 
    Select-Object -First 32

Edit

I would look at trying to implement an implit heap to reduce memory usage without hurting performance (possibly even improves it... to be tested)

Edit 2

If the filenames are not required, the easiest gain on memory is to not include them in the results.

[System.IO.Directory]::EnumerateFiles('c:\temp', '*.*', [System.IO.SearchOption]::AllDirectories) | 
    Foreach-Object {
        [System.IO.FileInfo]::New($_).Length
    } | 
    Sort-Object length -Descending | 
    Select-Object -First 32

CodePudding user response：

Firstly, if you're going to use Get-ChildItem then you should pass the -File switch parameter so that [System.IO.DirectoryInfo] instances never enter the pipeline.

Secondly, you're not passing the -Force switch parameter to Get-ChildItem, so any hidden files in that directory structure won't be retrieved.

Thirdly, note that your code is retrieving the 32 largest files, not the files with the 32 largest lengths. That is, if files 31, 32, and 33 are all the same length, then file 33 will be arbitrarily excluded from the final count. If that distinction is important to you you could rewrite your code like this...

$filesByLength = Get-ChildItem -File -Force -Recurse -Path 'C:\Temp\' |
    Group-Object -AsHashTable -Property Length
$big32 = $filesByLength.Keys |
    Sort-Object -Descending |
    Select-Object -First 32 |
    ForEach-Object -Process { $filesByLength[$_] } |
    Measure-Object -Property Length -Sum

$filesByLength is a [Hashtable] that maps from a length to the file(s) with that length. The Keys property contains all of the unique lengths of all of the retrieved files, so we get the 32 largest keys/lengths and use each one to send all the files of that length down the pipeline.

Most importantly, sorting the retrieved files to find the largest ones is problematic for several reasons:

Sorting cannot start until all of the input data is available, meaning at that point in time all 1.4 million [System.IO.FileInfo] instances will be present in memory.
- I'm not sure how Sort-Object buffers the incoming pipeline data, but I imagine it would be some kind of list that doubles in size every time it needs more capacity, leading to further garbage in memory to be cleaned up.
Each of the 1.4 million [System.IO.FileInfo] instances will be accessed a second time to get their Length property, all the while whatever sorting manipulations (depending on what algorithm Sort-Object uses) are occurring, too.

Since we only care about a mere 32 largest files/lengths out of 1.4 million files, what if we only kept track of those 32 instead of all 1.4 million? Consider if we only wanted to find the single largest file...

$largestFileLength = 0
$largestFile = $null

foreach ($file in Get-ChildItem -File -Force -Recurse -Path 'C:\Temp\')
{
    # Track the largest length in a separate variable to avoid two comparisons...
    #     if ($largestFile -eq $null -or $file.Length -gt $largestFile.Length)
    if ($file.Length -gt $largestFileLength)
    {
        $largestFileLength = $file.Length
        $largestFile = $file
    }
}

Write-Host -Message "The largest file is named ""$($largestFile.Name)"" and has length $largestFileLength."

As opposed to Get-ChildItem ... | Sort-Object -Property Length -Descending | Select-Object -First 1, this has the advantage of only one [FileInfo] object being "in-flight" at a time and the complete set of [System.IO.FileInfo]s being enumerated only once. Now all we need to do is to take the same approach but expanded from 1 file/length "slot" to 32...

$basePath = 'C:\Temp\'
$lengthsToKeep = 32
$includeZeroLengthFiles = $false

$listType = 'System.Collections.Generic.List[System.IO.FileInfo]'
# A SortedDictionary[,] could be used instead to avoid having to fully enumerate the Keys
# property to find the new minimum length, but add/remove/retrieve performance is worse
$dictionaryType = "System.Collections.Generic.Dictionary[System.Int64, $listType]"

# Create a dictionary pre-sized to the maximum number of lengths to keep
$filesByLength = New-Object -TypeName $dictionaryType -ArgumentList $lengthsToKeep

# Cache the minimum length currently being kept
$minimumKeptLength = -1L

Get-ChildItem -File -Force -Recurse -Path $basePath |
    ForEach-Object -Process {
        if ($_.Length -gt 0 -or $includeZeroLengthFiles)
        {
            $list = $null
            if ($filesByLength.TryGetValue($_.Length, [ref] $list))
            {
                # The current file's length is already being kept
                # Add the current file to the existing list for this length
                $list.Add($_)
            }
            else
            {
                # The current file's length is not being kept

                if ($filesByLength.Count -lt $lengthsToKeep)
                {
                    # There are still available slots to keep more lengths

                    $list = New-Object -TypeName $listType

                    # The current file's length will occupy an empty slot of kept lengths
                }
                elseif ($_.Length -gt $minimumKeptLength)
                {
                    # There are no available slots to keep more lengths
                    # The current file's length is large enough to keep

                    # Get the list for the minimum length
                    $list = $filesByLength[$minimumKeptLength]

                    # Remove the minimum length to make room for the current length
                    $filesByLength.Remove($minimumKeptLength) |
                        Out-Null

                    # Reuse the list for the now-removed minimum length instead of allocating a new one
                    $list.Clear()

                    # The current file's length will occupy the newly-vacated slot of kept lengths
                }
                else
                {
                    # There are no available slots to keep more lengths
                    # The current file's length is too small to keep
                    return
                }
                $list.Add($_)

                $filesByLength.Add($_.Length, $list)
                $minimumKeptLength = ($filesByLength.Keys | Measure-Object -Minimum).Minimum
            }
        }
    }

# Unwrap the files in each by-length list
foreach ($list in $filesByLength.Values)
{
    foreach ($file in $list)
    {
        $file
    }
}

I went with the approach, described above, of retrieving the files with the 32 largest lengths. A [Dictionary[Int64, List[FileInfo]]] is used to track those 32 largest lengths and the corresponding files with that length. For each input file, we first check if its length is among the largest so far and, if so, add the file to the existing List[FileInfo] for that length. Otherwise, if there's still room in the dictionary we can unconditionally add the input file and its length, or if the input file is at least bigger than the smallest tracked length we can remove that smallest length and add in its place the input file and its length. Once there are no more input files we output all of the [FileInfo] objects from all of the [List[FileInfo]]s in the [Dictionary[Int64, [List[FileInfo]]]].

I ran this simple benchmarking template...

1..5 |
    ForEach-Object -Process {
        [GC]::Collect()

        return Measure-Command -Expression {
            # Code to test
        }
    } | Measure-Object -Property 'TotalSeconds' -Minimum -Maximum -Average

...on PowerShell 7.2 against my $Env:WinDir directory (325,000 files) with these results:

`# Code to test`	Minimum	Maximum	Average	Memory usage*
`Get-ChildItem -File -Force -Recurse -Path $Env:WinDir`	69.7240896	79.727841	72.81731518	260 MB
Get `$Env:WinDir` files with 32 largest lengths using `-AsHashtable`, `Sort-Object`	82.7488729	83.5245153	83.04068032	1 GB
Get `$Env:WinDir` files with 32 largest lengths using dictionary of by-length lists	81.6003697	82.7035483	82.15654538	235 MB

* As observed in the Task Manager → Details tab → Memory (active private working set) column

I'm a little disappointed that my solution is only about 1% faster than the code using the Keys of a [Hashtable], but perhaps grouping the files using a compiled cmdlet vs. not grouping or sorting them but with more (interpreted) PowerShell code is a wash. Still, the difference in memory usage is significant, though I can't explain why the Get-ChildItem call to simply enumerate all files ended up using a bit more.