Home > other >  Replace the first line of a humongous csv file
Replace the first line of a humongous csv file

Time:04-30

What is the most efficient way of replacing the first line of a HUMONGOUS CSV file? The files I am working with are nearly 40gb in size. I am currently reading in the header row, cleaning it up, and trying to write it back to the file. This is currently what I am doing in powershell when writing back to the file:

Import-CSV $file -Header $newHeader | Select -Skip 1 | Export-CSV -Path $newFile -NoTypeInformation

I am wondering if there is a way for me to replace that first line, without reading the entire file into memory, then writing it to a new file.

UPDATE

I have modified my script to read/write using the following:

$sr = [System.IO.StreamReader] $inputFile
$sw = [System.IO.StreamWriter] $outputFile

#Get rid of the first row
$null = $sr.ReadLine()

$Header = #my new header
$sw.WriteLine($Header)
        
#Write the remainder of the old file to the new report file
while($line = $sr.ReadLine()) {
    $sw.WriteLine($line)
}

CodePudding user response:

This was initially proposed by zett42 in his helpful comment some time ago and I think this would be the best balance between speed and memory efficiency in this case. The concept is to read and write in chunks.

It is important to understand that this may bring encoding issues as I explained in the same linked answer, you need to test for yourself if the code as-is works correctly or you need to change the encoding (how to read and write using a specific encoding is also explained in that answer).


Using 4kb as the recommended buffer size based on these excellent answers:


using namespace System.IO
using namespace System.Text

try {
    $sr = [File]::OpenRead('absolute\path\to\input.csv')
    $sw = [File]::OpenWrite('absolute\path\to\output.csv')
    $buffer = [char[]]::new(4kb)
    $reader = [StreamReader]::new($sr, [Encoding]::UTF8, $true, $buffer.Length)
    $writer = [StreamWriter]::new($sw, [Encoding]::UTF8, $true, $buffer.Length)

    # ignore the first line (headers)
    $null = $reader.ReadLine()
    # define the new headers
    $header = '"ColumnA","ColumnB","ColumnC"'
    # and write them to the first line of our output file
    $writer.WriteLine($Header)

    # read the file in chunks of `$buffer` size
    for($len = $reader.Read($buffer); $len; $len = $reader.Read($buffer)) {
        # write the output in chunks of `$buffer` size
        $writer.Write($buffer, 0, $len)
    }
}
finally {
    # flush and dispose the streams
    $writer.ForEach('Flush')
    ($reader, $writer, $sr, $sw).ForEach('Dispose')
}

CodePudding user response:

You could use a very simple script. Not sure about efficiency. But will function. This isn't proven to be the fastest (but I've had much better performance than other commands). But it reads the file one line at a time so extremely reduces memory usage allowing for completion of the task.

$count =1
For each ($line in [system.io.file]::readlines()) {
    If ($count -eq 1) {
        $count  
        Continue
    }
    $line | Out-File c:\out.csv -Append 
}

You could continually increment count and save every 1000 lines to prevent excessive Out-File calls. This might optimize it a bit.

If you're just trying to delete the first line, something like notepad might be able to partially open it so you can edit that line and save the file. (It doesn't open the entire file only where you're looking plus a few pages)

Another option is to use Linux. Wsl might work. Their file editing capabilities have blown PowerShell out of the water.

Using tools like awk or sed. Or maybe just get a line count and tail the last (length -1) of the file. Trust me. It's fast. Let's put it this way. PowerShell took 15 mins to deal with a 150meg file and Linux did it in 2 or 3 seconds.

  • Related