I have a scenario where I need to edit very large files and the end result is rather simple, but achieving it has become a bit of a drag on my computer and memory. Due to downstream systems, I cannot load a duplicate file (according to a computed hash) twice. My workaround has been to move the first actual line/record to the end of the file without changing anything else. This method (shown below in Method 1) works great for files that are small enough, but now I have files that are extremely large. So I began working on Method 2 below, but I haven't quite figured out how to stream lines from an input file into an output file.
#Method 1
$Prefix = Read-Host -Prompt "What do you want to use as the prefix for the updated file names? (The number 1 is the default)"
If ([string]::IsNullOrEmpty($Prefix)){$Prefix = '1_'}
If($Prefix[-1] -ne '_'){$Prefix = "$($Prefix)_"}
$files = (Get-ChildItem -LiteralPath $PWD -Filter '*.csv' -File)
Foreach ($inputFile in $files){
$A = Get-Content $inputFile
$Header = $A[0]
$Data = $A[2..($A.Count-1)]
$Footer = $A[1]
$Header, $Data, $Footer | Add-Content -LiteralPath "$($inputFile.DirectoryName)\$($Prefix)$($inputFile.BaseName).csv"
}
#Work-in-progress Method 2
$inputFile = "\Input.csv"
$outputFile = "\Output.csv"
#Create StringReader
$sr = [System.IO.StringReader]::New((Get-Content $inputFile -Raw))
#Create StringWriter
$sw = [System.IO.StringWriter]::New()
#Write the Header
$sw.Write($sr.ReadLine())
#Get the first actual record as a string
$lastLine = $sr.ReadLine()
#Write the rest of the lines
$sw.Write($sr.ReadToEnd())
#Add the final line
$sw.Write($lastLine)
#Write everything to the outputFile
[System.IO.File]::WriteAllText($outputFile, $sw.ToString())
Get-Content:
Line |
5 | $sr = [System.IO.StringReader]::New((Get-Content $inputFile -Raw))
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~
| Insufficient memory to continue the execution of the program.
MethodInvocationException:
Line |
5 | $sr = [System.IO.StringReader]::New((Get-Content $inputFile -Raw))
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| Exception calling ".ctor" with "1" argument(s): "Value cannot be null. (Parameter 's')"
I'm having a bit of trouble comprehending the difference between a StringWriter
itself and a StringBuilder
, for example - why would I choose to use the StringWriter
as I have rather than simply work with a StringBuilder
directly? Most importantly though, the current iteration of Method 2 requires more memory than my system has and it isn't actually streaming the characters/lines/data from the input file to the output file. Are there built in methods for checking memory that I'm overlooking, or is there simply a better way to achieve my goal?
CodePudding user response:
The nice thing of the PowerShell Pipeline is that it streams by nature.
If correctly used, meaning:
- Do not assign any pipeline results to a variable and
- Do not use parenthesis
As that will choke the pipeline.
In your case:
$Prefix = Read-Host -Prompt "What do you want to use as the prefix for the updated file names? (The number 1 is the default)"
If ([string]::IsNullOrEmpty($Prefix)){$Prefix = '1_'}
If($Prefix[-1] -ne '_'){$Prefix = "$($Prefix)_"}
Get-ChildItem -LiteralPath $PWD -Filter '*.csv' -File |
Import-Csv |ForEach-Object -Begin { $Index = 0 } -Process {
if ($Index ) { $_ } else { $Footer = $_ }
} -End { $Footer } |
Export-Csv -LiteralPath "$($inputFile.DirectoryName)\$($Prefix)$($inputFile.BaseName).csv"
CodePudding user response:
This is how your code would look using StreamReader
and StreamWriter
:
Get-ChildItem -LiteralPath $PWD -Filter '*.csv' -File | ForEach-Object {
try {
$path = "$($_.DirectoryName)\$Prefix$($_.BaseName).csv"
$writer = [IO.StreamWriter] $path
$stream = $_.OpenRead()
$reader = [IO.StreamReader] $stream
$header = $reader.ReadLine()
while(-not $reader.EndOfStream) {
$writer.WriteLine($reader.ReadLine())
}
$writer.WriteLine($header)
}
finally {
$stream, $reader, $writer | ForEach-Object Dispose
}
}
This method will keep memory usage as low as possible and will be as efficient as it gets.