Home > Software design >  What is the good way to read data from CSV and converting them to JSON?
What is the good way to read data from CSV and converting them to JSON?

Time:11-30

I am trying to read the data from CSV file which has 2200000 records using PowerShell and storing each record in JSON file, but this takes almost 12 hours.

Sample CSV Data:

We will only concern about the 1st column value's.

enter image description here

Code:

function Read-IPData
{
    $dbFilePath = Get-ChildItem -Path $rootDir -Filter "IP2*.CSV" | ForEach-Object{ $_.FullName }
    Write-Host "file path - $dbFilePath"

    Write-Host "Reading..."
    $data = Get-Content -Path $dbFilePath | Select-Object -Skip 1
    Write-Host "Reading data finished"

    $count = $data.Count
    Write-host "Total $count records found"

    return $data
}

function Convert-NumbetToIP
{
    param(
    [Parameter(Mandatory=$true)][string]$number
    )

    try
    {    
        $w = [int64]($number/16777216)%6
        $x = [int64]($number/65536)%6
        $y = [int64]($number/256)%6
        $z = [int64]$number%6

        $ipAddress = "$w.$x.$y.$z"
        Write-Host "IP Address - $ipAddress"

        return $ipAddress
    }
    catch
    {
        Write-Host "$_"
        continue
    }
}

Write-Host "Getting IP Addresses from $dbFileName"
$data = Read-IPData

Write-Host "Checking whether output.json file exist, if not create"
$outputFile = Join-Path -Path $rootDir -ChildPath "output.json"

if(!(Test-Path $outputFile))
{
    Write-Host "$outputFile doestnot exist, creating..."
    New-Item -Path $outputFile -type "file"
}

foreach($item in $data)
{
    $row = $item -split ","
    $ipNumber = $row[0].trim('"')

    Write-Host "Converting $ipNumber to ipaddress"
    $toIpAddress = Convert-NumbetToIP -number $ipNumber

    Write-Host "Preparing document JSON"
    $object = [PSCustomObject]@{
        "ip-address" = $toIpAddress
        "is-vpn" = "true"
        "@timestamp" = (Get-Date).ToString("o")
    }

    $document = $object | ConvertTo-Json -Compress -Depth 100
    Write-Host "Adding document - $document"
    Add-Content -Path $outputFile $document
}

Could you please help optimize the code or is there a better way to do it. or is there a way like multi-threading.

CodePudding user response:

You can optimize the function Convert-NumberToIP like below:

function Convert-NumberToIP {
    param(
        [Parameter(Mandatory=$true)][uint32]$number
    )

    # either do the math yourself like this:

    # $w = ($number -shr 24) -band 255
    # $x = ($number -shr 16) -band 255
    # $y = ($number -shr 8) -band 255
    # $z = $number -band 255
    # '{0}.{1}.{2}.{3}' -f $w, $x, $y, $z  # output the dotted IP string

    # or use .Net:
    $n = ([IPAddress]$number).GetAddressBytes()
    [array]::Reverse($n)
    ([IPAddress]$n).IPAddressToString
}

CodePudding user response:

Here is a possible optimization:


function Get-IPDataPath
{
    $dbFilePath = Get-ChildItem -Path $rootDir -Filter "IP2*.CSV" | ForEach-Object FullName | Select-Object -First 1
    Write-Host "file path - $dbFilePath"
    $dbFilePath  # implicit output
}

function Convert-NumberToIP
{
    param(
        [Parameter(Mandatory=$true)][string]$number
    )

    [Int64] $numberInt = 0
    if( [Int64]::TryParse( $number, [ref] $numberInt ) ) {
        if( ($numberInt -ge 0) -and ($numberInt -le 0xFFFFFFFFl) ) {
            $ipBytes = ([IPAddress] $numberInt).GetAddressBytes()
            $ipBytes -join '.'  # implicit output
        }
    }
    # In case TryParse() returns $false or the number is out of range for an IPv4 address, 
    # the output of this function will be empty, which converts to $false in a boolean context.
}

$dbFilePath = Get-IPDataPath
Write-Host "Getting IP Addresses from $dbFilePath"

Write-Host "Checking whether output.json file exist, if not create"
$outputFile = Join-Path -Path $rootDir -ChildPath "output.json"

if(!(Test-Path $outputFile))
{
    Write-Host "$outputFile does not exist, creating..."
    $null = New-Item -Path $outputFile -type "file"
}

$object = [PSCustomObject]@{
    'ip-address' = ''
    'is-vpn' = 'true'
    '@timestamp' = ''
}

# Enclose foreach loop in a script block to be able to pipe its output to Set-Content
& {
    foreach( $item in [Linq.Enumerable]::Skip( [IO.File]::ReadLines( $dbFilePath ), 1 ) )
    {
        $row = $item -split ','
        $ipNumber = $row[0].trim('"')

        if( $ip = Convert-NumberToIP -number $ipNumber ) 
        {
            $object.'ip-address' = $ip
            $object.'@timestamp' = (Get-Date).ToString('o')

            # Implicit output
            $object | ConvertTo-Json -Compress -Depth 100
        }

    }
} | Set-Content -Path $outputFile

Remarks:

  • Avoid Get-Content, especially for line-by-line processing it tends to be slow. A much faster alternative is the File.ReadLines method. To skip the header line, use the Linq.Enumerable.Skip() method.
  • There is no need to read the whole CSV into memory first. Using ReadLines in a foreach loop does lazy enumeration, i. e. it reads only one line per loop iteration. This works because it returns an enumerator instead of a collection of lines.
  • Avoid try and catch if exceptions occur often, because the "exceptional" code path is very slow. Instead use Int64.TryParse() which returns a boolean indicating successful conversion.
  • Instead of "manually" converting the IP number to bytes, use the IPAddress class which has a constructor that takes an integer number. Use its method .GetAddressBytes() to get an array of bytes in network (big-endian) order. Finally use the PowerShell -join operator to create a string of the expected format.
  • Don't allocate a [pscustomobject] for each row, which has some overhead. Create it once before the loop and inside the loop only assign the values.
  • Avoid Write-Host (or any output to the console) within inner loops.
  • Note that the output isn't a valid JSON file, but rather a concatenation of multiple JSON files (one per CSV row). If you actually want this to be one big, valid JSON file, enclose the output in [ ] and insert a comma , between each row.

Modified processing loop to write a single, valid JSON file instead of a concatenation of JSON files:

& {
    '[' # begin array
    $first = $true

    foreach( $item in [Linq.Enumerable]::Skip( [IO.File]::ReadLines( $dbFilePath ), 1 ) )
    {
        $row = $item -split ','
        $ipNumber = $row[0].trim('"')

        if( $ip = Convert-NumberToIP -number $ipNumber ) 
        {
            $object.'ip-address' = $ip
            $object.'@timestamp' = (Get-Date).ToString('o')
        
            $row = $object | ConvertTo-Json -Compress -Depth 100

            # write array element delimiter if necessary
            if( $first ) { $row; $first = $false } else { ",$row" } 
        }

    }
    ']'  # end array
} | Set-Content -Path $outputFile
  • Related