Home > Enterprise >  How to Modify Large CSV with PowerShell without using all Server Memory
How to Modify Large CSV with PowerShell without using all Server Memory

Time:10-14

I am using PowerShell to do some data modification on CSV files before importing them into Oracle. I have watched the resource monitor while the process is running and the process is chewing up all 20 GB of available memory on the server. One of my CSVs is roughly 90 MB having almost 200 columns and 100K rows. The Generated CSV is about 120 MB. Here is the code I am currently using:

# Process Configuration File
$path = $PSScriptRoot   "\"

#Set Extraction Date-Time in format for Oracle Timestamp with TZ
$date = Get-Date -Format "yyyy-MM-dd HH:mm:ss K"

Import-Csv -Path ($path   'documents.csv') -Encoding UTF8 |
   # Convert Date Time values that are always populated
   % {$_.document_creation_date__v = ([datetime]($_.document_creation_date__v)).ToString('yyyy-MM-dd HH:mm:ss K');$_} |
   % {$_.version_creation_date__v = ([datetime]($_.version_creation_date__v)).ToString('yyyy-MM-dd HH:mm:ss K');$_} |
   % {$_.version_modified_date__v = ([datetime]($_.version_modified_date__v)).ToString('yyyy-MM-dd HH:mm:ss K');$_} |

   # Convert DateTime values that may be blank
   % {if($_.binder_last_autofiled_date__v -gt ""){$_.binder_last_autofiled_date__v = ([datetime]($_.binder_last_autofiled_date__v)).ToString('yyyy-MM-dd HH:mm:ss K')};$_} |
   % {if($_.locked_date__v -gt ""){$_.locked_date__v = ([datetime]($_.locked_date__v)).ToString('yyyy-MM-dd HH:mm:ss K')};$_} |

   # Fix Multi-Select Picklist fields, replacing value divider with "|"
   % {$_.clinical_data__c = ((($_.clinical_data__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.composition_formulation_ingredients__c = ((($_.composition_formulation_ingredients__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.content_category__c = ((($_.content_category__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.crm_disable_actions__v = ((($_.crm_disable_actions__v).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.indication_dosage_administration__c = ((($_.indication_dosage_administration__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.pharmacodynamics_and_pharmacokinetics__c = ((($_.pharmacodynamics_and_pharmacokinetics__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.indication__c = ((($_.indication__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.rights_channel__v = ((($_.rights_channel__v).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.rights_language__v = ((($_.rights_language__v).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.safety__c = ((($_.safety__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.special_population__c = ((($_.special_population__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.storage_stability__c = ((($_.storage_stability__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.ta_subcategory__c = ((($_.ta_subcategory__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.tags__v = ((($_.tags__v).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.user_groups__c = ((($_.user_groups__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.vaccines__c = ((($_.vaccines__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.channels__c = ((($_.channels__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.material_type__c = ((($_.material_type__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.target_audience__c = ((($_.target_audience__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |

   # Trim values that can be too long
   % {$_.product__v = ($_.product__v)[0..254] -join "";$_} |

   # Add ExtractDate Column
   Select-Object *,@{Name='Extract_Date';Expression={$date}} |

   #Export Results
   Export-Csv ($path   'VMC_DOCUMENTS.csv') -NoTypeInformation -Encoding UTF8

Is there a more efficient way to modify large CSV files with PowerShell than what I am currently doing? The process takes roughly 10 minutes to complete. I am by no means a PowerShell guru and built my scripts based on info from this site and the MS PowerShell documentation. Any suggestions would be greatly appreciated.

Here is a data to create a sample documents.csv with a single record:

allow_pdf_download__v,allow_source_download__v,annotations_all__v,annotations_anchors__v,annotations_lines__v,annotations_links__v,annotations_notes__v,annotations_resolved__v,annotations_unresolved__v,associated_content_notes__c,author__c,batch_number__v,binder__v,binder_created_from__v,binder_last_autofiled_by__v,binder_last_autofiled_date__v,binder_locked__v,binder_metadata__v,bound_source_major_version__v,bound_source_minor_version__v,classification__v,clinical_data__c,composition_formulation_ingredients__c,content_category__c,copyright__c,copyright_license_expiration__c,copyright_owner__c,copyright_title__c,country__v,created_by__v,crosslink__v,date_permissions_obtained__c,decision_date__c,description_of_copyrighted_content__c,detail_group__v,disclaimer__c,document_creation_date__v,document_fit__v,document_host_url__v,document_number__v,source_type__c,dossier_type__c,duration_of_use__c,email_domain__v,email_template_type__v,expiration_date__c,external_id__v,extra_scientific_content__c,filename__v,format__v,from_address__v,from_name__v,ftp_source_location__v,grant_type__c,id,indication_disease__c,indication_dosage_administration__c,intended_use__c,language__c,last_modified_by__v,latest_source_major_version__v,latest_source_minor_version__v,latest_version__v,legacy_document_number__c,legal_approval_form__c,legal_approval_required__c,lifecycle__v,link_status__v,locked__v,locked_by__v,locked_date__v,major_version_number__v,md5checksum__v,members_of_public__c,minor_version_number__v,name__v,obtained_by__c,one_of_use__c,other__c,pages__v,payment_amount_usd__c,payment_date__c,payment_made__c,permissions_fee__c,pharmacodynamics_and_pharmacokinetics__c,product__v,public_content__v,publication_date__c,reapproval_cycle_count__c,reapproval_date__c,reason_for_iactivation__c,region_code__c,rendition_black_list_flag__v,reply_to_address__v,reply_to_name__v,response_type__c,restrict_fragments_by_product__v,restricted_countries__c,rights_channel__v,rights_countries__v,rights_expiration_date__v,rights_language__v,rights_other__v,rights_resource_type__v,safety__c,size__v,source__c,source_binding_rule__v,source_document_id__v,source_document_name__v,source_document_number__v,source_owner__v,source_vault_id__v,source_vault_name__v,special_population__c,start_date__c,status__v,storage_stability__c,subject__v,submission_date__c,subtype__v,tags__v,target__c,target_description__c,template_doctype__v,territory__v,therapeutic_area__c,title__v,type__v,use_location__c,user_groups__c,vaccines__c,version_created_by__v,version_creation_date__v,version_id,version_modified_date__v,clm_content__v,clm_id__v,crm_custom_reaction__v,crm_directory__v,crm_disable_actions__v,crm_enable_survey_overlay__v,crm_end_date__v,crm_hidden__v,crm_segment__v,crm_start_date__v,crm_survey__v,crm_training__v,engage_html_filename__v,cdn_content__v,check_consent__v,production_cdn_url__v,crm_product__v,ta_subcategory__c,notify_msls_of_significant_update__c,global_id__sys,global_version_id__sys,link__sys,version_link__sys,activity_end_date__c,activity_name__c,activity_start_date__c,activity_type__c,business_owner__c,channels__c,material_type__c,objective__c,proactive__c,target_audience__c,indication__c
"00W000000000101",,0,0,0,0,0,0,0,,,,false,,,,false,,,,,"Immunogenicity",,"Clinical Data,Special Population",false,,,,"00C000000000389",1436711,false,,,,,,2018-05-15T09:03:51.000Z,"Fit Width",,MED--TST-1923,,,,,,2020-06-10,2634,,Test.docx,application/vnd.openxmlformats-officedocument.wordprocessingml.document,,,,,10000,"Vaccines",,,,1,,,false,TST50316,,,Advanced LC,,false,,,3,398ea1bf3682f8c8e51cde5bd133bb73,false,0,Use of XXXXXXXXXXXXXXXX vaccine recombinant in Transplant Patients,,false,,4,,,,,,"00P000000001F36",true,,1,2018-08-31,,,false,,,,,,,,,,,,,16815,,,,,,,,,,,Expired,,,,Global Response,,,,,,,Use of XXXXXXXXXXX vaccine recombinant in Transplant Patients,Global Content (Advanced),,,,1436711,2018-05-15T09:03:51.000Z,10000_3_0,2020-07-02T13:17:11.000Z,false,,,,,false,,false,,,,false,,false,,,,,,23108_10000,23108_10000_19347,,,,,,,,,,,,,

CodePudding user response:

PowerShell's Import-Csv cmdlet is a known memory hog, primarily due to the high memory requirements of the [pscustomobject] instances that it constructs - see GitHub issue #7603.

There are several mitigation strategies, in ascending order of complexity:

  • In your ForEach-Object (%) script block (you should combine your separate % calls into one), force garbage collection every, say, 1000 objects to relieve memory pressure.

    • As Santiago Squarzon points out, the inefficient implementation of ForEach-Object - as of PowerShell 7.2.x, see GitHub issue #10982 - exacerbates the problem, both with respect to memory consumption and runtime.

    • See the code below, which combines periodic garbage collection with . { process { ... } } as a faster and memory-friendlier alternative to ForEach-Object.[1]

  • Use a custom PowerShell class to represent your CSV rows, but note that this will increase execution time.

    • See this answer for an example.

    • GitHub issue #8862 proposes building this functionality into Import-Csv, so as to make it construct instances of a given type in lieu of [pscustomobject]s to begin with.

  • If the above approaches turn out to be too slow, you'll need to resort to a third-party .NET parser library such as CSVHelper.

    • See this blog post with comparative benchmarks with links to many libraries, and the answers to this SO question (focused on C#).

    • Unfortunately, using general-purpose .NET NuGet packages is cumbersome in PowerShell as of v7.2.x. This answer shows what is currently needed. GitHub issue #6724 asks for a future Add-Type improvement for direct support of NuGet packages.


Here's a streamlined formulation of your code that implements periodic garbage collection to relieve memory pressure:

# Process Configuration File
$path = $PSScriptRoot   '\'

#Set Extraction Date-Time in format for Oracle Timestamp with TZ
$date = Get-Date -Format "yyyy-MM-dd HH:mm:ss K"

# See above for why . { process { ... } } is used in lieu of % { ... }
$i = 0
Import-Csv -Path ($path   'documents.csv') -Encoding UTF8 | . {
    process {

      # Perform garbage collection every 1000 objects 
      # in order to relieve memory pressure.
      if (  $i % 1000 -eq 0) { [GC]::Collect() }

      # Convert Date Time values that are always populated
      $_.document_creation_date__v = ([datetime]($_.document_creation_date__v)).ToString('yyyy-MM-dd HH:mm:ss K')
      $_.version_creation_date__v = ([datetime]($_.version_creation_date__v)).ToString('yyyy-MM-dd HH:mm:ss K')
      $_.version_modified_date__v = ([datetime]($_.version_modified_date__v)).ToString('yyyy-MM-dd HH:mm:ss K')

      # Convert DateTime values that may be blank
      if ($_.binder_last_autofiled_date__v -gt "") { $_.binder_last_autofiled_date__v = ([datetime]($_.binder_last_autofiled_date__v)).ToString('yyyy-MM-dd HH:mm:ss K') }
      if ($_.locked_date__v -gt "") { $_.locked_date__v = ([datetime]($_.locked_date__v)).ToString('yyyy-MM-dd HH:mm:ss K') }

      # Fix Multi-Select Picklist fields, replacing value divider with "|"
      $_.clinical_data__c = ((($_.clinical_data__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.composition_formulation_ingredients__c = ((($_.composition_formulation_ingredients__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.content_category__c = ((($_.content_category__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.crm_disable_actions__v = ((($_.crm_disable_actions__v).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.indication_dosage_administration__c = ((($_.indication_dosage_administration__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.pharmacodynamics_and_pharmacokinetics__c = ((($_.pharmacodynamics_and_pharmacokinetics__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.indication__c = ((($_.indication__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.rights_channel__v = ((($_.rights_channel__v).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.rights_language__v = ((($_.rights_language__v).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.safety__c = ((($_.safety__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.special_population__c = ((($_.special_population__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.storage_stability__c = ((($_.storage_stability__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.ta_subcategory__c = ((($_.ta_subcategory__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.tags__v = ((($_.tags__v).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.user_groups__c = ((($_.user_groups__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.vaccines__c = ((($_.vaccines__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.channels__c = ((($_.channels__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.material_type__c = ((($_.material_type__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.target_audience__c = ((($_.target_audience__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')

      # Trim values that can be too long
      $_.product__v = ($_.product__v)[0..254] -join ""

      # Finally add an ExtractDate Column and output the modified object
      # (-PassThru) - this obviates the need for a separate Select-Object call.
      Add-Member -InputObject $_ -PassThru -NotePropertyName 'Extract_Date' -NotePropertyValue $date
    }
  } |
  Export-Csv ($path   'VMC_DOCUMENTS.csv') -NoTypeInformation -Encoding UTF8

[1] Note that the variation & { process { ... } }, i.e. execution in a child scope, can speed up execution (see this answer for an explanation), but again increases memory consumption, which is why it isn't used here.

CodePudding user response:

@mklement0 makes a bunch of great points. Here's one more to add:

When It Comes to Big Files, You Might Need to Stream

Import-CSV and Export-CSV can be memory intensive because they are loading the entire file into memory. When you're dealing with big enough files, you may want to consider reading using a stream.

For most file formats I'd say tread carefully (or use someone else's class that reads from a stream). CSV, however, is famously simple.

So, here's a proof of concept of reading CSVs via a stream.

if (-not $csvStream) {
    $csvStream = [IO.File]::Open("$psScriptRoot\big.csv", 'Open', 'Read')
}

if (-not $csvReader) {
    $csvReader = [io.streamreader]::new($csvStream, $true)   
}

$csvSplitter = [Regex]::New('","')
$csvTrimmer  = [Regex]::new('(?>^"|"$)')
$firstLine = $csvReader.ReadLine()
$columnNames = $csvSplitter.Split($csvTrimmer.Replace($firstLine, ''))

$restOfLines = @(
while ($csvStream.Position -lt $csvStream.Length) {
    $csvReader.ReadLine()
})

$csvReader.Close()
$csvStream.Close()
$csvStream = $null
$csvReader = $null

$rowObjects = 
    foreach ($line in $restOfLines) {
        $row = [Ordered]@{}
        $splitLine = $csvSplitter.Split($csvTrimmer.Replace($line, ''))
        for ($columnNumber = 0 ; $columnNumber -lt $splitLine.Length; $columnNumber  ) {
            $row[$columnNames[$columnNumber]] = $splitLine[$columnNumber]
        }
        [PSCustomObject]$row
    }

I may come back to this later, or produce a function based off of this, but this approach shows you how simple it is to read CSVs directly in PowerShell.

Tested against my 10mb "big" CSV, it took less than a second to read in everything and ~3 seconds to read in everything and convert it to PSObjects.

More pertinently, if you used the object pipeline while streaming, you could ensure that only the current row would be in memory.

  • Related