Home > Back-end >  Slowness to Remove 3,7 and 9 column from | separated txt file using PowerShell
Slowness to Remove 3,7 and 9 column from | separated txt file using PowerShell

Time:06-12

I have Pipe separated data file with huge data and i want to remove 3,7, and 9 column. below script is working 100% fine. but its too slow its taking 5 mins for 22MB file.

Adeel|01|test|1234589|date|amount|00|123345678890|test|all|01| Adeel|00|test|1234589|date|amount|00|123345678890|test|all|00| Adeel|00|test|1234589|date|amount|00|123345678890|test|all|00| Adeel|00|test|1234589|date|amount|00|123345678890|test|all|00| Adeel|05|test|1234589|date|amount|00|123345678890|test|all|05| Adeel|00|test|1234589|date|amount|00|123345678890|test|all|00| Adeel|00|test|1234589|date|amount|00|123345678890|test|all|00| Adeel|00|test|1234589|date|amount|00|123345678890|test|all|00| Adeel|09|test|1234589|date|amount|00|123345678890|test|all|09| Adeel|00|test|1234589|date|amount|00|123345678890|test|all|00| Adeel|00|test|1234589|date|amount|00|123345678890|test|all|00| Adeel|12|test|1234589|date|amount|00|123345678890|test|all|12|

    param
(
    # Input data file
    [string]$Path = 'O:\Temp\test.txt',
    # Columns to be removed, any order, dupes are allowed
    [int[]]$Remove = (3,6)
)

# sort indexes descending and remove dupes
$Remove = $Remove | Sort-Object -Unique -Descending

# read input lines
Get-Content $Path | .{process{
    # split and add to ArrayList which allows to remove items
    $list = [Collections.ArrayList]($_ -split '\|')

    # remove data at the indexes (from tail to head due to descending order)
    foreach($i in $Remove) {
        $list.RemoveAt($i)
    }

    # join and output
    #$list -join '|'
    $contentUpdate=$list -join '|'
    Add-Content "O:\Temp\testoutput.txt" $contentUpdate
}
}

CodePudding user response:

Get-Content is comparatively slow. Use of the pipeline adds additional overhead.

When performance matters, StreamReader and StreamWriter can be a better choice:

param (
    # Input data file
    [string] $InputPath = 'input.txt',
    # Output data file
    [string] $OutputPath = 'output.txt',
    # Columns to be removed, any order, dupes are allowed
    [int[]] $Remove = (1, 2, 2),
    # Column separator
    [string] $Separator = '|',
    # Input file encoding
    [Text.Encoding] $Encoding = [Text.Encoding]::UTF8
)

$ErrorActionPreference = 'Stop'

# Gets rid of dupes and provides fast lookup ability
$removeSet = [Collections.Generic.HashSet[int]] $Remove

$reader = $writer = $null

try {
    $reader = [IO.StreamReader]::new(( Convert-Path -LiteralPath $InputPath ), $encoding )

    $null = New-Item $OutputPath -ItemType File -Force  # as Convert-Path requires existing path

    while( $line = $reader.ReadLine() ) {

        if( -not $writer ) {
            # Construct writer only after first line has been read, so $reader.CurrentEncoding is available 
            $writer = [IO.StreamWriter]::new(( Convert-Path -LiteralPath $OutputPath ), $false, $reader.CurrentEncoding )
        }

        $columns = $line.Split( $separator )
        $isAppend = $false

        for( $i = 0; $i -lt $columns.Length; $i   ) {
            if( -not $removeSet.Contains( $i ) ) {
                if( $isAppend ) { $writer.Write( $separator ) }
                $writer.Write( $columns[ $i ] )
                $isAppend = $true
            }
        }

        $writer.WriteLine()  # Write (CR)LF
    }
}
finally {
    # Make sure to dispose the reader and writer so files get closed.
    if( $writer ) { $writer.Dispose() }
    if( $reader ) { $reader.Dispose() }
}
  • Convert-Path is used because .NET has a different current directory than PowerShell, so it's best practice to pass absolute paths to .NET API.
  • If this still isn't fast enough, consider writing this in C# instead. Especially with such "low level" code, C# tends to be faster. You may embed C# code in PowerShell using Add-Type -TypeDefinition $csCode.
  • As another optimization, instead of using String.Split() which creates more sub strings than actually needed, you may use String.IndexOf() and String.Substring() to only extract the necessary columns.
  • Last not least, you may experiment with StreamReader and StreamWriter constructors that lets you allow to specify a buffer size.
  • Related