I have Pipe separated data file with huge data and i want to remove 3,7, and 9 column. below script is working 100% fine. but its too slow its taking 5 mins for 22MB file.
Adeel|01|test|1234589|date|amount|00|123345678890|test|all|01| Adeel|00|test|1234589|date|amount|00|123345678890|test|all|00| Adeel|00|test|1234589|date|amount|00|123345678890|test|all|00| Adeel|00|test|1234589|date|amount|00|123345678890|test|all|00| Adeel|05|test|1234589|date|amount|00|123345678890|test|all|05| Adeel|00|test|1234589|date|amount|00|123345678890|test|all|00| Adeel|00|test|1234589|date|amount|00|123345678890|test|all|00| Adeel|00|test|1234589|date|amount|00|123345678890|test|all|00| Adeel|09|test|1234589|date|amount|00|123345678890|test|all|09| Adeel|00|test|1234589|date|amount|00|123345678890|test|all|00| Adeel|00|test|1234589|date|amount|00|123345678890|test|all|00| Adeel|12|test|1234589|date|amount|00|123345678890|test|all|12|
param
(
# Input data file
[string]$Path = 'O:\Temp\test.txt',
# Columns to be removed, any order, dupes are allowed
[int[]]$Remove = (3,6)
)
# sort indexes descending and remove dupes
$Remove = $Remove | Sort-Object -Unique -Descending
# read input lines
Get-Content $Path | .{process{
# split and add to ArrayList which allows to remove items
$list = [Collections.ArrayList]($_ -split '\|')
# remove data at the indexes (from tail to head due to descending order)
foreach($i in $Remove) {
$list.RemoveAt($i)
}
# join and output
#$list -join '|'
$contentUpdate=$list -join '|'
Add-Content "O:\Temp\testoutput.txt" $contentUpdate
}
}
CodePudding user response:
Get-Content
is comparatively slow. Use of the pipeline adds additional overhead.
When performance matters, StreamReader
and StreamWriter
can be a better choice:
param (
# Input data file
[string] $InputPath = 'input.txt',
# Output data file
[string] $OutputPath = 'output.txt',
# Columns to be removed, any order, dupes are allowed
[int[]] $Remove = (1, 2, 2),
# Column separator
[string] $Separator = '|',
# Input file encoding
[Text.Encoding] $Encoding = [Text.Encoding]::UTF8
)
$ErrorActionPreference = 'Stop'
# Gets rid of dupes and provides fast lookup ability
$removeSet = [Collections.Generic.HashSet[int]] $Remove
$reader = $writer = $null
try {
$reader = [IO.StreamReader]::new(( Convert-Path -LiteralPath $InputPath ), $encoding )
$null = New-Item $OutputPath -ItemType File -Force # as Convert-Path requires existing path
while( $line = $reader.ReadLine() ) {
if( -not $writer ) {
# Construct writer only after first line has been read, so $reader.CurrentEncoding is available
$writer = [IO.StreamWriter]::new(( Convert-Path -LiteralPath $OutputPath ), $false, $reader.CurrentEncoding )
}
$columns = $line.Split( $separator )
$isAppend = $false
for( $i = 0; $i -lt $columns.Length; $i ) {
if( -not $removeSet.Contains( $i ) ) {
if( $isAppend ) { $writer.Write( $separator ) }
$writer.Write( $columns[ $i ] )
$isAppend = $true
}
}
$writer.WriteLine() # Write (CR)LF
}
}
finally {
# Make sure to dispose the reader and writer so files get closed.
if( $writer ) { $writer.Dispose() }
if( $reader ) { $reader.Dispose() }
}
Convert-Path
is used because .NET has a different current directory than PowerShell, so it's best practice to pass absolute paths to .NET API.- If this still isn't fast enough, consider writing this in C# instead. Especially with such "low level" code, C# tends to be faster. You may embed C# code in PowerShell using
Add-Type -TypeDefinition $csCode
. - As another optimization, instead of using
String.Split()
which creates more sub strings than actually needed, you may useString.IndexOf()
andString.Substring()
to only extract the necessary columns. - Last not least, you may experiment with
StreamReader
andStreamWriter
constructors that lets you allow to specify a buffer size.