Home > Enterprise >  Powershell to split large csv. Taking Very long. Hoping to shorten process via code review
Powershell to split large csv. Taking Very long. Hoping to shorten process via code review

Time:08-31

The csv is about a gig and has over 1million rows. I'm attempting to split it into 5. The first 4 take about 18 hours - the 5th never happens ( or I havnt been patient enough) my pc will just sit in a sleep mode because it wont unlock.

function Split-Csv{

    [CmdletBinding()]

    Param(

        [Parameter(Mandatory)]

        [string]$FilePath,

        [Parameter()]

        [string]$Delimiter=',',

        [Parameter(Mandatory)]

        [string]$TargetFolder,

        [Parameter(Mandatory)]

        [string]$Name,

        [Parameter(Mandatory)]

        [int]$NumberOfFiles

    )

 

    try{

        if(-not (Test-Path $TargetFolder)){

            New-Item -Path $TargetFolder -ItemType Directory

        }

 

        $csvData=Import-Csv -Path $FilePath -Delimiter $Delimiter

 

        $startRow=0

 

        $numberOfRowsPerFile=[Math]::Ceiling($csvData.count/$NumberOfFiles)

 

        $counter=1

 

        while($startRow -lt $csvData.Count){

            $csvData | Select-Object -Skip $startRow -First $numberOfRowsPerFile | Export-Csv -Path "$TargetFolder\$Name-$counter.csv" -NoTypeInformation -NoClobber

            $startRow =$numberOfRowsPerFile

            $counter  

        }

    }catch{

        Write-Error $_.Exception.Message

    }

}
Split-Csv -FilePath "C:\temp\2018.csv" -Delimiter ',' -TargetFolder "C:\temp\Split" -Name "Split" -NumberOfFiles 5

CodePudding user response:

You can give this function a try, as I stated in comments, this will be much faster and memory friendly if you use the .NET APIs instead of treating the CSV as objects.

This function uses the StreamReader and StreamWriter classes to read and write line by line.

function Split-Csv {
    [CmdletBinding()]
    param(
        [Parameter(Mandatory, ValueFromPipelineByPropertyName)]
        [alias('FullName')]
        [string] $Path,

        [parameter(Mandatory)]
        [string] $DestinationFolder,

        [parameter()]
        [Int] $Chunks = 3
    )

    process {
        try {
            [IO.FileInfo] $Path = $PSCmdlet.GetUnresolvedProviderPathFromPSPath($Path)
            $Destination = $PSCmdlet.GetUnresolvedProviderPathFromPSPath($DestinationFolder)
            if(-not (Test-Path $Destination)) {
                $null = New-Item $Destination -ItemType Directory
            }
            $reader  = [IO.StreamReader] $Path.FullName
            $headers = $reader.ReadLine()
            $Index   = 0

            $newWriter = {
                $name     = $Path.BaseName   " - Part $Index"   $Path.Extension
                $newChunk = Join-Path $Destination -ChildPath $name
                $writer   = [IO.StreamWriter] $newChunk
                $writer.AutoFlush = $true
                $writer.WriteLine($headers)
                $writer
            }

            $ChunkSize = [math]::Ceiling($Path.Length / $Chunks)
            $writer    = & $newWriter

            while(-not $reader.EndOfStream) {
                if($writer.BaseStream.Length -ge $ChunkSize) {
                    $Index  
                    $writer.Dispose()
                    $writer = & $newWriter
                }
                $writer.WriteLine($reader.ReadLine())
            }
        }
        catch {
            $PSCmdlet.ThrowTerminatingError($_)
        }
        finally {
            $writer, $reader | ForEach-Object Dispose
        }
    }
}

So for example, assuming you have a CSV of 1GB and you want to split it into 5 chunks you can do the following:

Get-Item path\to\myBigCsv.csv | Split-Csv -DestinationFolder .\chunksfolder -Chunks 5

CodePudding user response:

I tested this code on various sizes of CSVs:

$incsv = Import-Csv .\BigCsv.csv

$incr = [int]($incsv.Count / 5)
$last = $incsv.Count - 1

$idx = @()

for ($i = 0; $i -le $last; $i  = $incr)
{
    $end = $i   ($incr - 1)
    if ($end -gt $last) { $end = $last }
    $idx  = @{ 'Start' = $i ; 'End' = $end }
}

for ($i = 0; $i -lt 5; $i  )
{
    $start = $idx[$i].Start
    $end = $idx[$i].End
    $incsv[$start..$end] | Export-Csv -NoType ".\Chunk$($i 1).csv"
}

I timed Import-Csv and each chunk export. Import-/Export-Csv don't scale very well, but should be OK for the size/rows you're working with. my final test CSV was 1.68 GB and ~1.1 million rows, 7200rpm drive on a 5-yo laptop - this should run much faster on better hardware:

Import-Csv : 00:05:51.2411580
Chunk 1    : 00:02:12.3754368
Chunk 2    : 00:01:16.5562393
Chunk 3    : 00:01:13.3434148
Chunk 4    : 00:01:14.5231756
Chunk 5    : 00:01:14.6382049

Total time : 00:13:02.6859795

the extra time from chunk1 might be a disk-related anomaly - in all the previous tests, all chunk times were near-identical, and the combined time for exports was roughly the same as the import time.

  • Related