I am trying to read the data from CSV file which has 2200000 records using PowerShell and storing each record in JSON file, but this takes almost 12 hours.
Sample CSV Data:
We will only concern about the 1st column value's.
Code:
function Read-IPData
{
$dbFilePath = Get-ChildItem -Path $rootDir -Filter "IP2*.CSV" | ForEach-Object{ $_.FullName }
Write-Host "file path - $dbFilePath"
Write-Host "Reading..."
$data = Get-Content -Path $dbFilePath | Select-Object -Skip 1
Write-Host "Reading data finished"
$count = $data.Count
Write-host "Total $count records found"
return $data
}
function Convert-NumbetToIP
{
param(
[Parameter(Mandatory=$true)][string]$number
)
try
{
$w = [int64]($number/16777216)%6
$x = [int64]($number/65536)%6
$y = [int64]($number/256)%6
$z = [int64]$number%6
$ipAddress = "$w.$x.$y.$z"
Write-Host "IP Address - $ipAddress"
return $ipAddress
}
catch
{
Write-Host "$_"
continue
}
}
Write-Host "Getting IP Addresses from $dbFileName"
$data = Read-IPData
Write-Host "Checking whether output.json file exist, if not create"
$outputFile = Join-Path -Path $rootDir -ChildPath "output.json"
if(!(Test-Path $outputFile))
{
Write-Host "$outputFile doestnot exist, creating..."
New-Item -Path $outputFile -type "file"
}
foreach($item in $data)
{
$row = $item -split ","
$ipNumber = $row[0].trim('"')
Write-Host "Converting $ipNumber to ipaddress"
$toIpAddress = Convert-NumbetToIP -number $ipNumber
Write-Host "Preparing document JSON"
$object = [PSCustomObject]@{
"ip-address" = $toIpAddress
"is-vpn" = "true"
"@timestamp" = (Get-Date).ToString("o")
}
$document = $object | ConvertTo-Json -Compress -Depth 100
Write-Host "Adding document - $document"
Add-Content -Path $outputFile $document
}
Could you please help optimize the code or is there a better way to do it. or is there a way like multi-threading.
CodePudding user response:
You can optimize the function Convert-NumberToIP like below:
function Convert-NumberToIP {
param(
[Parameter(Mandatory=$true)][uint32]$number
)
# either do the math yourself like this:
# $w = ($number -shr 24) -band 255
# $x = ($number -shr 16) -band 255
# $y = ($number -shr 8) -band 255
# $z = $number -band 255
# '{0}.{1}.{2}.{3}' -f $w, $x, $y, $z # output the dotted IP string
# or use .Net:
$n = ([IPAddress]$number).GetAddressBytes()
[array]::Reverse($n)
([IPAddress]$n).IPAddressToString
}
CodePudding user response:
Here is a possible optimization:
function Get-IPDataPath
{
$dbFilePath = Get-ChildItem -Path $rootDir -Filter "IP2*.CSV" | ForEach-Object FullName | Select-Object -First 1
Write-Host "file path - $dbFilePath"
$dbFilePath # implicit output
}
function Convert-NumberToIP
{
param(
[Parameter(Mandatory=$true)][string]$number
)
[Int64] $numberInt = 0
if( [Int64]::TryParse( $number, [ref] $numberInt ) ) {
if( ($numberInt -ge 0) -and ($numberInt -le 0xFFFFFFFFl) ) {
$ipBytes = ([IPAddress] $numberInt).GetAddressBytes()
$ipBytes -join '.' # implicit output
}
}
# In case TryParse() returns $false or the number is out of range for an IPv4 address,
# the output of this function will be empty, which converts to $false in a boolean context.
}
$dbFilePath = Get-IPDataPath
Write-Host "Getting IP Addresses from $dbFilePath"
Write-Host "Checking whether output.json file exist, if not create"
$outputFile = Join-Path -Path $rootDir -ChildPath "output.json"
if(!(Test-Path $outputFile))
{
Write-Host "$outputFile does not exist, creating..."
$null = New-Item -Path $outputFile -type "file"
}
$object = [PSCustomObject]@{
'ip-address' = ''
'is-vpn' = 'true'
'@timestamp' = ''
}
# Enclose foreach loop in a script block to be able to pipe its output to Set-Content
& {
foreach( $item in [Linq.Enumerable]::Skip( [IO.File]::ReadLines( $dbFilePath ), 1 ) )
{
$row = $item -split ','
$ipNumber = $row[0].trim('"')
if( $ip = Convert-NumberToIP -number $ipNumber )
{
$object.'ip-address' = $ip
$object.'@timestamp' = (Get-Date).ToString('o')
# Implicit output
$object | ConvertTo-Json -Compress -Depth 100
}
}
} | Set-Content -Path $outputFile
Remarks:
- Avoid
Get-Content
, especially for line-by-line processing it tends to be slow. A much faster alternative is theFile.ReadLines
method. To skip the header line, use theLinq.Enumerable.Skip()
method. - There is no need to read the whole CSV into memory first. Using
ReadLines
in aforeach
loop does lazy enumeration, i. e. it reads only one line per loop iteration. This works because it returns an enumerator instead of a collection of lines. - Avoid
try
andcatch
if exceptions occur often, because the "exceptional" code path is very slow. Instead useInt64.TryParse()
which returns aboolean
indicating successful conversion. - Instead of "manually" converting the IP number to bytes, use the
IPAddress
class which has a constructor that takes an integer number. Use its method.GetAddressBytes()
to get an array of bytes in network (big-endian) order. Finally use the PowerShell-join
operator to create a string of the expected format. - Don't allocate a
[pscustomobject]
for each row, which has some overhead. Create it once before the loop and inside the loop only assign the values. - Avoid
Write-Host
(or any output to the console) within inner loops. - Note that the output isn't a valid JSON file, but rather a concatenation of multiple JSON files (one per CSV row). If you actually want this to be one big, valid JSON file, enclose the output in
[ ]
and insert a comma,
between each row.
Modified processing loop to write a single, valid JSON file instead of a concatenation of JSON files:
& {
'[' # begin array
$first = $true
foreach( $item in [Linq.Enumerable]::Skip( [IO.File]::ReadLines( $dbFilePath ), 1 ) )
{
$row = $item -split ','
$ipNumber = $row[0].trim('"')
if( $ip = Convert-NumberToIP -number $ipNumber )
{
$object.'ip-address' = $ip
$object.'@timestamp' = (Get-Date).ToString('o')
$row = $object | ConvertTo-Json -Compress -Depth 100
# write array element delimiter if necessary
if( $first ) { $row; $first = $false } else { ",$row" }
}
}
']' # end array
} | Set-Content -Path $outputFile