Home > Software design >  optimize full path extraction of 1 million files /- and iteration from file
optimize full path extraction of 1 million files /- and iteration from file

Time:08-24

I am a programming enthusiast and novice, I am using Powershell to try to solve the following need:

  1. I need to extract the full path of files with extension .img. inside a folder with /- 900 thousand folders and /- million files. -/ 900,000 img files.
  2. Each img file must be processed in an exe. that is read from a file.
    Which is better to store the result of the GetChildItem in a variable or a file?

I would greatly appreciate your guidance and support to optimize and / or find the best way to speed up processes vs. resource consumption. Thank you un advance!!
This is the code I am currently using:

$PSDefaultParameterValues['*:Encoding'] = 'Ascii'
$host.ui.RawUI.WindowTitle = “DICOM IMPORT IN PROGRESS”
#region SET WINDOW FIXED WIDTH
$pshost = get-host
$pswindow = $pshost.ui.rawui
$newsize = $pswindow.buffersize
$newsize.height = 3000
$newsize.width = 150
$pswindow.buffersize = $newsize
$newsize = $pswindow.windowsize
$newsize.height = 50
$newsize.width = 150
$pswindow.windowsize = $newsize
#endregion
#
$out = ("$pwd\log_{0:yyyyMMdd_HH.mm.ss}_import.txt" -f (Get-Date))
cls
"`n" | tee -FilePath  $out -Append
"*****************" | tee -FilePath  $out -Append
"**IMPORT SCRIPT**" | tee -FilePath  $out -Append
"*****************" | tee -FilePath  $out -Append
"`n" | tee -FilePath  $out -Append
#
# SET SEARCH FOLDERS #
"Working Folder" | tee -FilePath  $out -Append
$path1 = Read-Host "Enter folder location" | tee -FilePath  $out -Append
"`n" | tee -FilePath  $out -Append
#
#
# SET & SHOW HOSTNAME
"SERVER NAME" | tee -FilePath  $out -Append
$ht = hostname | tee -FilePath $out -Append
Write-Host $ht
Start-Sleep -Seconds 3
"`n" | tee -FilePath  $out -Append
#
#
# GET FILES
"`n" | tee -FilePath  $out -Append
#"SEARCHING IMG FILES, PLEASE WAIT..." | tee -FilePath  $out -Append
$files = $path1 | Get-ChildItem -recurse -file -filter *.img | ForEach-Object { $_.FullName }
# SHOW Get-ChildItem PROCESS ON CONSOLE
Out-host -InputObject $files 
"`n" | tee -FilePath  $out -Append
Write-Output ($files | Measure).Count "IMG FILES FOUND TO PUSH" | tee -FilePath  $out -Append
# DUMP Get-ChildIte into a file
$files > $pwd\pf
Start-Sleep -Seconds 5

# TIMESTAMP
"`n" | tee -FilePath  $out -Append
"IMPORT START" | tee -FilePath  $out -Append
("{0:yyyy/MM/dd HH:mm:ss}" -f (Get-Date)) | tee -FilePath $out -Append
"********************************" | tee -FilePath  $out -Append
"`n" | tee -FilePath  $out -Append
#
#
#SET TOOL
$ir = $Env:folder_tool
$pt = "utils\tool.exe"
#
#PROCESSING FILES
$n = 1
$pe = foreach ($file in Get-Content $pwd\pf ) {
    $tb = (Get-Date -f HH:mm:ss) | tee -FilePath  $out -Append
    $fp = "$n. $file" | tee -FilePath  $out -Append
    #
    $ep = & $ir$pt -c $ht"FIR" -i $file | tee -FilePath  $out -Append
    $as = "`n" | tee -FilePath  $out -Append
    # PRINT CONSOLE IMG FILES PROCESS
    Write-Host $tb
    Write-Host $fp
    Out-host -InputObject $ep
    Write-Host $as
    $n  
}  
#
#TIMESTAMP
"********************************" | tee -FilePath  $out -Append
"IMPORT END" | tee -FilePath  $out -Append
("{0:yyyy/MM/dd HH:mm:ss}" -f (Get-Date)) | tee -FilePath  $out -Append
"`n" | tee -FilePath  $out -Append

CodePudding user response:

Which is better to store the result of the GetChildItem in a variable or a file?

If you're hoping to keep memory utilization low, the best solution is to not store them at all - simply consume the output from Get-ChildItem directly:

$pe = Get-ChildItem -Recurse -File -filter *.img |ForEach-Object {
    $file = $_.FullName
    $tb = (Get-Date -f HH:mm:ss) | tee -FilePath  $out -Append
    $fp = "$n. $file" | tee -FilePath  $out -Append
    #
    $ep = & $ir$pt -c $ht"FIR" -i $file | tee -FilePath  $out -Append
    $as = "`n" | tee -FilePath  $out -Append
    # PRINT CONSOLE IMG FILES PROCESS
    Write-Host $tb
    Write-Host $fp
    Out-host -InputObject $ep
    Write-Host $as
    $n  
}

CodePudding user response:

Try using parallel with PoshRSJob. Replace Start-Process in Process-File with your code and note that there is no access to console. Process-File must return string. Adjust $JobCount and $inData.

The main idea is to load all file list into ConcurrentQueue, start 20 background jobs and wait them to exit. Each job will take value from queue and pass to Process-File, then repeat until queue is empty.


NOTE: If you stop script, RS Jobs will continue to run until they finished or powershell closed. Use Get-RSJob | Stop-RSJob and Get-RSJob | Remove-RSJob to stop background work


Import-Module PoshRSJob

Function Process-File
{
    Param(
       [String]$FilePath
    )
    $process = Start-Process -FilePath 'ping.exe' -ArgumentList '-n 5 127.0.0.1' -PassThru -WindowStyle Hidden
    $process.WaitForExit();
    return "Processed $FilePath"
}

$JobCount = [Environment]::ProcessorCount - 2 
$inData = [System.Collections.Concurrent.ConcurrentQueue[string]]::new(
    [System.IO.Directory]::EnumerateFiles('S:\SCRIPTS\FileTest', '*.img')
    )
 
$JobScript = [scriptblock]{
    $inQueue = [System.Collections.Concurrent.ConcurrentQueue[string]]$args[0]
    $outBag = [System.Collections.Concurrent.ConcurrentBag[string]]$args[1]
    $currentItem = $null
    while($inQueue.TryDequeue([ref] $currentItem) -eq $true)
    {
        try
        {
            # Add result to OutBag
            $result = Process-File -FilePath $currentItem -EA Stop
            $outBag.Add( $result )
        }
        catch
        {
            # Catch error
            Write-Output $_.Exception.ToString()
        }
    }
}
 

 
$resultData = [System.Collections.Concurrent.ConcurrentBag[string]]::new()
 
$i_cur = $inData.Count
$i_max = $i_cur
 
# Start jobs
$jobs = @(1..$JobCount) | % { Start-RSJob -ScriptBlock $JobScript -ArgumentList @($inData, $resultData) -FunctionsToImport @('Process-File') }
 
# Wait queue to empty
while($i_cur -gt 0)
{
    Write-Progress -Activity 'Doing job' -Status "$($i_cur) left of $($i_max)" -PercentComplete (100 - ($i_cur / $i_max * 100)) 
    Start-Sleep -Seconds 3 # Update frequency
    $i_cur = $inData.Count
}
 
# Wait jobs to complete
$logs = $jobs | % { Wait-RSJob -Job $_ } | % { Receive-RSJob -Job $_  } 
$jobs | % { Remove-RSJob -Job $_ }
$Global:resultData = $resultData
$Global:logs = $logs

$Global:resultData is array of Process-File return strings

  • Related