Home > Software engineering >  Powershell - Faster way of getting rows from a csv file and outputting to a different csv file
Powershell - Faster way of getting rows from a csv file and outputting to a different csv file

Time:01-04

I have a csv file that contains approx 250,000 rows. The first of the headers on this csv file is "Campaign Ref Code". I also have a list of campaign reference codes that I need to look up in this first column (ie COLMABQ140, COLMABQ141). If I find this campaign ref code in the first column I want to output the information for this campaign code (ie address, date contacted etc) into a separate csv file. I have a number of these codes and need to produce a number of separate files specific to each separate campaign code. I have actually managed to get a working script for this BUT it takes over an hour to run and I'd like to find a quicker way of processing this file. Basically, I create an array of the campaign codes then loop through them using Import-csv & exporting to a separate file using Export-csv if I find a match (see a snippet of the code below - there are 20 different $refs arrays & corresponding for loops). Like I say, it does work perfectly well in as much as it gets the information I need in a format I specified but there has to be a quicker way of doing things...even if the original file is a quarter of a million rows!!

$file = 'Orig.csv'
$newfile = "File1.$today.csv"
        
$refs = @('COLMABQ140','COLMABQ141','COLMABQ142','COLMABQ143','COLMABQ144','COLMABQ176','COLMABQ177','COLMABQ178','COLMABQ179','COLMABQ180')

    foreach ($ref in $refs) {

        Import-csv $file | Where-Object {$_.'Campaign Ref Code' -like "$ref"} | Export-CSV -notype -Path $workdir\$newfile -Append 
    
    }
        
$newfile = "File2.$today.csv"
        
$refs = @('COLMABP140','COLMABP141','COLMABP142','COLMABP143','COLMABP144','COLMABP176','COLMABP177','COLMABP178','COLMABP179','COLMABP180')

    foreach ($ref in $refs) {

        Import-csv $file | Where-Object {$_.'Campaign Ref Code' -like "$ref"} | Export-CSV -notype -Path $workdir\$newfile -Append 
    
    }
        
$newfile = "File3.$today.csv"
        
$refs = @('COLMABS140','COLMABS141','COLMABS142','COLMABS143','COLMABS144','COLMABS176','COLMABS177','COLMABS178','COLMABS179','COLMABS180')

    foreach ($ref in $refs) {

        Import-csv $file | Where-Object {$_.'Campaign Ref Code' -like "$ref"} | Export-CSV -notype -Path $workdir\$newfile -Append 
    
    }
            

CodePudding user response:

The performance issues with your script are:

  • You're iterating multiple times through the Orig.csv input file.
    It is faster to create a hashtable with your files and the specific $refs values to test for.
  • You're reopening multiple times the output file to append to it.
    It is faster to use (3) stepablepipelines for this, see: What good does a SteppablePipeline.
    Note that a correctly setup PowerShell pipeline had a low memory usage.
  • The -eq operator is probably slightly faster than the -like operator
    (apparently you don't need the -like operator as you don't have any wildcards in the values.)
    You might simply depend on the common comparison operator feature:

When the input of an operator is a scalar value, the operator returns a Boolean value. When the input is a collection, the operator returns the elements of the collection that match the right-hand value of the expression. If there are no matches in the collection, comparison operators return an empty array.

$workdir = ...
$refs =@{
    "File1.csv" = 'COLMABQ140','COLMABQ141','COLMABQ142','COLMABQ143','COLMABQ144','COLMABQ176','COLMABQ177','COLMABQ178','COLMABQ179','COLMABQ180'
    "File2.csv" = 'COLMABP140','COLMABP141','COLMABP142','COLMABP143','COLMABP144','COLMABP176','COLMABP177','COLMABP178','COLMABP179','COLMABP180'
    "File3.csv" = 'COLMABS140','COLMABS141','COLMABS142','COLMABS143','COLMABS144','COLMABS176','COLMABS177','COLMABS178','COLMABS179','COLMABS180'
}

$Pipelines = @{}
Import-csv .\Orig.csv |ForEach-Object -Begin {
    foreach ($file in $refs.keys) {
        $Pipelines[$file] = { Export-CSV -notype -Path "$workdir\$file" }.GetSteppablePipeline()
        $Pipelines[$file].Begin($file)
    }
} -Process {
    foreach ($file in $refs.keys) {
        if ($refs[$file] -eq $_.'Campaign Ref Code') { $Pipelines[$file].Process($_) }
    }
} -End {
    foreach ($file in $refs.keys) {
        $Pipelines[$file].End()
    }
}

CodePudding user response:

I've posted a comment above asking for clarification because your example code is a bit confusing. At this point I'm going to assume that $refs changes multiple times, but each day you run this you never need more than one $file and you're also always filtering on the Campaign Ref Code column.

The first big issue is that you're parsing and enumerating the file once for each Campaign Ref Code. That's going to be quite expensive. You can also speed up execution quite a bit by grouping the CSV file into a hash table, which will make lookups very fast. You're also appending to the same CSV file repeatedly, which is also fairly expensive.

# Load the file ONCE and group it by the Campaign Ref Code into a hashtable
$csv = Import-Csv $file | Group-Object -Property 'Campaign Ref Code' -AsHashTable

$newfile = "File1.$today.csv"
$refs = @('COLMABQ140','COLMABQ141','COLMABQ142','COLMABQ143','COLMABQ144','COLMABQ176','COLMABQ177','COLMABQ178','COLMABQ179','COLMABQ180')

$csv[$refs] | Export-CSV -NoTypeInformation -Path $workdir\$newfile


$newfile = "File2.$today.csv"
$refs = @('COLMABQ140','COLMABQ141','COLMABQ142','COLMABQ143','COLMABQ144','COLMABQ176','COLMABQ177','COLMABQ178','COLMABQ179','COLMABQ180')

$csv[$refs] | Export-CSV -NoTypeInformation -Path $workdir\$newfile


$newfile = "File3.$today.csv"
$refs = @('COLMABQ140','COLMABQ141','COLMABQ142','COLMABQ143','COLMABQ144','COLMABQ176','COLMABQ177','COLMABQ178','COLMABQ179','COLMABQ180')

$csv[$refs] | Export-CSV -NoTypeInformation -Path $workdir\$newfile

The real limitation of this method is that your $refs need to be precisely correct. Leading and trailing spaces need to match exactly, if there were any. You can't use globbing patterns like you could with the -like operator. It doesn't look like you're actually doing that, though, so it probably doesn't actually matter. It will also output the file in a different order than your method does. If you need them in the order of the $refs variable, you'll need to change it to something like the code below, but I am not sure that that is guaranteed to create precisely the same order.

$csv[$refs] | 
    Sort-Object -Property @{e={ $refs.IndexOf($_.'Campaign Ref Code') }} |
    Export-CSV -NoTypeInformation -Path $workdir\$newfile

Note that the Group-Object command's performance is improved in Powershell v5.1 compared to prior versions, and it's very significantly improved in Powershell v6 .

  • Related