I have a csv file that contains approx 250,000 rows. The first of the headers on this csv file is "Campaign Ref Code". I also have a list of campaign reference codes that I need to look up in this first column (ie COLMABQ140, COLMABQ141). If I find this campaign ref code in the first column I want to output the information for this campaign code (ie address, date contacted etc) into a separate csv file. I have a number of these codes and need to produce a number of separate files specific to each separate campaign code. I have actually managed to get a working script for this BUT it takes over an hour to run and I'd like to find a quicker way of processing this file. Basically, I create an array of the campaign codes then loop through them using Import-csv & exporting to a separate file using Export-csv if I find a match (see a snippet of the code below - there are 20 different $refs arrays & corresponding for loops). Like I say, it does work perfectly well in as much as it gets the information I need in a format I specified but there has to be a quicker way of doing things...even if the original file is a quarter of a million rows!!
$file = 'Orig.csv'
$newfile = "File1.$today.csv"
$refs = @('COLMABQ140','COLMABQ141','COLMABQ142','COLMABQ143','COLMABQ144','COLMABQ176','COLMABQ177','COLMABQ178','COLMABQ179','COLMABQ180')
foreach ($ref in $refs) {
Import-csv $file | Where-Object {$_.'Campaign Ref Code' -like "$ref"} | Export-CSV -notype -Path $workdir\$newfile -Append
}
$newfile = "File2.$today.csv"
$refs = @('COLMABP140','COLMABP141','COLMABP142','COLMABP143','COLMABP144','COLMABP176','COLMABP177','COLMABP178','COLMABP179','COLMABP180')
foreach ($ref in $refs) {
Import-csv $file | Where-Object {$_.'Campaign Ref Code' -like "$ref"} | Export-CSV -notype -Path $workdir\$newfile -Append
}
$newfile = "File3.$today.csv"
$refs = @('COLMABS140','COLMABS141','COLMABS142','COLMABS143','COLMABS144','COLMABS176','COLMABS177','COLMABS178','COLMABS179','COLMABS180')
foreach ($ref in $refs) {
Import-csv $file | Where-Object {$_.'Campaign Ref Code' -like "$ref"} | Export-CSV -notype -Path $workdir\$newfile -Append
}
CodePudding user response:
The performance issues with your script are:
- You're iterating multiple times through the
Orig.csv
input file.
It is faster to create a hashtable with your files and the specific$refs
values to test for. - You're reopening multiple times the output file to append to it.
It is faster to use (3) stepablepipelines for this, see: What good does a SteppablePipeline.
Note that a correctly setup PowerShell pipeline had a low memory usage. - The
-eq
operator is probably slightly faster than the-like
operator
(apparently you don't need the-like
operator as you don't have any wildcards in the values.)
You might simply depend on the common comparison operator feature:
When the input of an operator is a scalar value, the operator returns a Boolean value. When the input is a collection, the operator returns the elements of the collection that match the right-hand value of the expression. If there are no matches in the collection, comparison operators return an empty array.
$workdir = ...
$refs =@{
"File1.csv" = 'COLMABQ140','COLMABQ141','COLMABQ142','COLMABQ143','COLMABQ144','COLMABQ176','COLMABQ177','COLMABQ178','COLMABQ179','COLMABQ180'
"File2.csv" = 'COLMABP140','COLMABP141','COLMABP142','COLMABP143','COLMABP144','COLMABP176','COLMABP177','COLMABP178','COLMABP179','COLMABP180'
"File3.csv" = 'COLMABS140','COLMABS141','COLMABS142','COLMABS143','COLMABS144','COLMABS176','COLMABS177','COLMABS178','COLMABS179','COLMABS180'
}
$Pipelines = @{}
Import-csv .\Orig.csv |ForEach-Object -Begin {
foreach ($file in $refs.keys) {
$Pipelines[$file] = { Export-CSV -notype -Path "$workdir\$file" }.GetSteppablePipeline()
$Pipelines[$file].Begin($file)
}
} -Process {
foreach ($file in $refs.keys) {
if ($refs[$file] -eq $_.'Campaign Ref Code') { $Pipelines[$file].Process($_) }
}
} -End {
foreach ($file in $refs.keys) {
$Pipelines[$file].End()
}
}
CodePudding user response:
I've posted a comment above asking for clarification because your example code is a bit confusing. At this point I'm going to assume that $refs
changes multiple times, but each day you run this you never need more than one $file
and you're also always filtering on the Campaign Ref Code
column.
The first big issue is that you're parsing and enumerating the file once for each Campaign Ref Code
. That's going to be quite expensive. You can also speed up execution quite a bit by grouping the CSV file into a hash table, which will make lookups very fast. You're also appending to the same CSV file repeatedly, which is also fairly expensive.
# Load the file ONCE and group it by the Campaign Ref Code into a hashtable
$csv = Import-Csv $file | Group-Object -Property 'Campaign Ref Code' -AsHashTable
$newfile = "File1.$today.csv"
$refs = @('COLMABQ140','COLMABQ141','COLMABQ142','COLMABQ143','COLMABQ144','COLMABQ176','COLMABQ177','COLMABQ178','COLMABQ179','COLMABQ180')
$csv[$refs] | Export-CSV -NoTypeInformation -Path $workdir\$newfile
$newfile = "File2.$today.csv"
$refs = @('COLMABQ140','COLMABQ141','COLMABQ142','COLMABQ143','COLMABQ144','COLMABQ176','COLMABQ177','COLMABQ178','COLMABQ179','COLMABQ180')
$csv[$refs] | Export-CSV -NoTypeInformation -Path $workdir\$newfile
$newfile = "File3.$today.csv"
$refs = @('COLMABQ140','COLMABQ141','COLMABQ142','COLMABQ143','COLMABQ144','COLMABQ176','COLMABQ177','COLMABQ178','COLMABQ179','COLMABQ180')
$csv[$refs] | Export-CSV -NoTypeInformation -Path $workdir\$newfile
The real limitation of this method is that your $refs
need to be precisely correct. Leading and trailing spaces need to match exactly, if there were any. You can't use globbing patterns like you could with the -like
operator. It doesn't look like you're actually doing that, though, so it probably doesn't actually matter. It will also output the file in a different order than your method does. If you need them in the order of the $refs
variable, you'll need to change it to something like the code below, but I am not sure that that is guaranteed to create precisely the same order.
$csv[$refs] |
Sort-Object -Property @{e={ $refs.IndexOf($_.'Campaign Ref Code') }} |
Export-CSV -NoTypeInformation -Path $workdir\$newfile
Note that the Group-Object
command's performance is improved in Powershell v5.1 compared to prior versions, and it's very significantly improved in Powershell v6 .