Home > Blockchain >  PowerShell IO large files without loading everything to memory
PowerShell IO large files without loading everything to memory

Time:11-09

I have a scenario where I need to edit very large files and the end result is rather simple, but achieving it has become a bit of a drag on my computer and memory. Due to downstream systems, I cannot load a duplicate file (according to a computed hash) twice. My workaround has been to move the first actual line/record to the end of the file without changing anything else. This method (shown below in Method 1) works great for files that are small enough, but now I have files that are extremely large. So I began working on Method 2 below, but I haven't quite figured out how to stream lines from an input file into an output file.

#Method 1
$Prefix = Read-Host -Prompt "What do you want to use as the prefix for the updated file names? (The number 1 is the default)"
If ([string]::IsNullOrEmpty($Prefix)){$Prefix = '1_'}
If($Prefix[-1] -ne '_'){$Prefix = "$($Prefix)_"}
$files = (Get-ChildItem -LiteralPath $PWD -Filter '*.csv' -File)
Foreach ($inputFile in $files){
    $A = Get-Content $inputFile
    $Header = $A[0]
    $Data = $A[2..($A.Count-1)]
    $Footer = $A[1]
    $Header, $Data, $Footer | Add-Content -LiteralPath "$($inputFile.DirectoryName)\$($Prefix)$($inputFile.BaseName).csv"
}
#Work-in-progress Method 2
$inputFile = "\Input.csv"
$outputFile = "\Output.csv"

#Create StringReader
$sr = [System.IO.StringReader]::New((Get-Content $inputFile -Raw))

#Create StringWriter
$sw = [System.IO.StringWriter]::New()

#Write the Header
$sw.Write($sr.ReadLine())

#Get the first actual record as a string
$lastLine = $sr.ReadLine()

#Write the rest of the lines
$sw.Write($sr.ReadToEnd())

#Add the final line
$sw.Write($lastLine)

#Write everything to the outputFile
[System.IO.File]::WriteAllText($outputFile, $sw.ToString())

Get-Content:
Line |
   5 |  $sr = [System.IO.StringReader]::New((Get-Content $inputFile -Raw))
     |                                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~
     | Insufficient memory to continue the execution of the program.
MethodInvocationException:
Line |
   5 |  $sr = [System.IO.StringReader]::New((Get-Content $inputFile -Raw))
     |  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     | Exception calling ".ctor" with "1" argument(s): "Value cannot be null. (Parameter 's')"

I'm having a bit of trouble comprehending the difference between a StringWriter itself and a StringBuilder, for example - why would I choose to use the StringWriter as I have rather than simply work with a StringBuilder directly? Most importantly though, the current iteration of Method 2 requires more memory than my system has and it isn't actually streaming the characters/lines/data from the input file to the output file. Are there built in methods for checking memory that I'm overlooking, or is there simply a better way to achieve my goal?

CodePudding user response:

The nice thing of the PowerShell Pipeline is that it streams by nature.
If correctly used, meaning:

  • Do not assign any pipeline results to a variable and
  • Do not use parenthesis

As that will choke the pipeline.

In your case:

$Prefix = Read-Host -Prompt "What do you want to use as the prefix for the updated file names? (The number 1 is the default)"
If ([string]::IsNullOrEmpty($Prefix)){$Prefix = '1_'}
If($Prefix[-1] -ne '_'){$Prefix = "$($Prefix)_"}

Get-ChildItem -LiteralPath $PWD -Filter '*.csv' -File |
Import-Csv |ForEach-Object -Begin { $Index = 0 } -Process {
    if ($Index  ) { $_ } else { $Footer = $_ }
} -End { $Footer } |
Export-Csv -LiteralPath "$($inputFile.DirectoryName)\$($Prefix)$($inputFile.BaseName).csv"

CodePudding user response:

This is how your code would look using StreamReader and StreamWriter:

Get-ChildItem -LiteralPath $PWD -Filter '*.csv' -File | ForEach-Object {
    try {
        $path   = "$($_.DirectoryName)\$Prefix$($_.BaseName).csv"
        $writer = [IO.StreamWriter] $path
        $stream = $_.OpenRead()
        $reader = [IO.StreamReader] $stream
        $header = $reader.ReadLine()
        while(-not $reader.EndOfStream) {
            $writer.WriteLine($reader.ReadLine())
        }
        $writer.WriteLine($header)
    }
    finally {
        $stream, $reader, $writer | ForEach-Object Dispose
    }
}

This method will keep memory usage as low as possible and will be as efficient as it gets.

  • Related