Powershell: Efficient way to delete first 10 rows of a HUGE textfile-CodePudding

I need to delete the first couple of lines of a .txt-file in powershell. There are plenty of questions and answers already on SA how to do it. Most of them copy the whole filecontent into memory, cut out the first x lines and then save the content into the textfile again. However, in my case the textfiles are huge (500MB ), so loading them completly into memory, just to delete the first couple of lines, takes very long and feels like a huge waste of resources.

Is there a more elegant approach? If you only want to read the first x lines, you can use

Get-Content in.csv -Head 10

, which only reads the first 10 lines. Is there something similar for deletion?

CodePudding user response：

Here is a another way to do it using StreamReader and StreamWriter, as noted in comments, it's important to know the encoding of your file for this use case.

See Remarks from the Official Documentation:

The StreamReader object attempts to detect the encoding by looking at the first four bytes of the stream. It automatically recognizes UTF-8, little-endian Unicode, big-endian Unicode, little-endian UTF-32, and big-endian UTF-32 text if the file starts with the appropriate byte order marks. Otherwise, the user-provided encoding is used. See the Encoding.GetPreamble method for more information.

If you need to specify an Encoding you can target the StreamReader(String, Encoding) Constructor. For example:

$reader = [System.IO.StreamReader]::new('path\to\input.csv', [System.Text.Encoding]::UTF8)

As noted previously in Remarks, this might not be needed for common encodings.

An alternative to below code, could be the use of $reader.ReadToEnd() as Brice points out in his comment, after skipping the first 10 lines, this would read the entire contents of the file in memory before writing to the new file. I haven't used this method for this answer since, mklement0's helpful answer provides a very similar solution to the problem and this answer was intended to be a memory friendly solution.

try {
    $reader = [System.IO.StreamReader]::new('absolute\path\to\input.csv')
    $writer = [System.IO.StreamWriter]::new('absolute\path\to\output.csv')

    # skip 10 lines
    foreach($i in 1..10) {
        $null = $reader.ReadLine()
    }

    while(-not $reader.EndOfStream) {
        $writer.WriteLine($reader.ReadLine())
    }
}
finally {
    ($reader, $writer).foreach('Dispose')
}

It's very also worth noting zett42's helpful comment using $reader.ReadBlock(Char[], Int32, Int32) method could be an even faster and still memory friendly alternative.

CodePudding user response：

You're essentially attempting to remove the starting bytes of the file without modifying the remaining bytes, Raymond C has a good read posted here about why that can't be done.

The underlying abstract model for storage of file contents is in the form of a chunk of bytes, each indexed by the file offset. The reason appending bytes and truncating bytes is so easy is that doing so doesn’t alter the file offsets of any other bytes in the file. If a file has ten bytes and you append one more, the offsets of the first ten bytes stay the same. On the other hand, deleting bytes from the front or middle of a file means that all the bytes that came after the deleted bytes need to “slide down” to close up the space. And there is no “slide down” file system function.

CodePudding user response：

As Mike Anthony's helpful answer explains, there is no system-level function that efficiently implements what you're trying to do, so you have no choice but to rewrite your file.

While memory-intensive, the following solution is reasonably fast:

Read the file as a whole into memory, as a single string, using Get-Content's -Raw switch...
- This is orders of magnitude faster than the line-by-line streaming that Get-Content performs by default.
... then use regex processing to strip the first 10 lines ...
... and save the trimmed content back to disk.

Important:

Since this rewrites the file in place, be sure to have a backup copy of your file.
Use -Encoding with Get-Content / Set-Content to correctly interpret the input / control the output character encoding (PowerShell fundamentally doesn't preserve the information about the character encoding of a file that was read with Get-Content). Without -Encoding, the default encoding is the system's active ANSI code page in Windows PowerShell, and, more sensibly, BOM-less UTF-8 in PowerShell (Core) 7 .

# Use -Encoding as needed.
(Get-Content -Raw in.csv) -replace '^(?:.*\r?\n){10}' | 
  Set-Content -NoNewLine in.csv

If the file is too large to fit into memory:

If you happen to have WSL installed, an efficient, streaming tail solution is possible:

Note:

Your input file must use a character encoding in which a LF character is represented as a single 0xA byte - which is true of most single-byte encodings and also of the variable-width UTF-8 encoding, but not of, say, UTF-16.
You must output to a different file (which you can later replace the input file with).

bash.exe -c 'tail  11 in.csv > out.csv'

Otherwise, line-by-line processing is required.

Caveat:

All line-by-line processing approaches risk inadvertently changing the newline format of the original file: on writing the lines back to a file, it is invariably the platform-native newline format that is used (CLRF on Windows, LF on Unix-like platforms).
Also, the information as to whether the input file had a trailing newline or not is lost.

Santiago's helpful answer shows a solution based on .NET APIs, which performs well by PowerShell standards.

For the sake of completeness, here's a comparatively slower, PowerShell-native solution using a switch statement with the -File parameter for fast line-by-line reading (much faster than Get-Content):

  & {
    $i = 0
    switch -File in.csv {
      default { if (  $i -ge 11) { $_ } }
    }
  } | Set-Content out.csv  # use -Encoding as needed

Note:

Since switch doesn't allow specifying a character encoding for the input file, this approach only works if the character encoding is correctly detected / assumed by default. While BOM-based files will be read correctly, note that switch makes different assumptions about BOM-less files based on the PowerShell edition: in Windows PowerShell, the system's active ANSI code page is assumed; in PowerShell (Core) 7 , it is UTF-8.
Because language statements cannot directly serve as pipeline input, the switch statement must be called via a script block (& { ... })

Streaming the resulting lines to Set-Content via the pipeline is what slows the solution down. Passing the new file content as an argument, to Set-Content's -Value parameter would drastically speed up the operation - but that would again require that the file fit into memory as a whole:

# Faster reformulation, but *input file must fit into memory as  whole*.
# `switch` offers a lot of flexibility. If that isn't needed
# and reading the file in full is acceptable, the
# the Get-Content -Raw solution at the top is the fastest Powershell solution.
Set-Content out.csv $(
  $i = 0
  switch -File in.csv {
    default { if (  $i -ge 11) { $_ } }
  }
)

CodePudding user response：

There may be another alternative by using switch to read the files line-by line and buffering a certain maximum amount of lines in a List. This would be lean on memory consumtion and at the same time limit the number of disk writes to speed up the process.

Something like this perhaps

$maxBuffer   = 10000  # the maximum number of lines to buffer
$linesBuffer = [System.Collections.Generic.List[string]]::new()

# get an array of the files you need to process
$files = Get-ChildItem -Path 'X:\path\to\the\input\files' -Filter '*.txt' -File
foreach ($file in $files) {
    # initialize a counter for omitting the first 10 lines lines and clear the buffer
    $omitCounter = 0 
    $linesBuffer.Clear()
    # create a new file path by appending '_New' to the input file's basename
    $outFile = '{0}\{1}_New{2}' -f $file.DirectoryName, $file.BaseName, $file.Extension

    switch -File $file.FullName {
        default {
            if ($omitCounter -ge 10) {
                if ($linesBuffer.Count -eq $maxBuffer) {
                    # write out the buffer to the new file and clear it for the next batch
                    Add-Content -Path $outFile -Value $linesBuffer
                    $linesBuffer.Clear()
                }
                $linesBuffer.Add($_)
            }
            else { $omitCounter   }  # no output, just increment the counter
        }
    }
    # here, check if there is still some data left in the buffer
    if ($linesBuffer.Count) { Add-Content -Path $outFile -Value $linesBuffer }
}