Home > Blockchain >  How to speed up processing of ~million lines of text in log file
How to speed up processing of ~million lines of text in log file

Time:11-05

I am trying to parse a very large log file that consists of space delimited text across about 16 fields. Unfortunately the app logs a blank line in between each legitimate one (arbitrarily doubling the lines I must process). It also causes fields to shift because it uses space as both a delineator as well as for empty fields. I couldn't get around this in LogParser. Fortunately Powershell affords me the option to reference fields from the end as well making it easier to get later fields affected by shift.

After a bit of testing with smaller sample files, I've determined that processing line by line as the file is streaming with Get-Content natively is slower than just reading the file completely using Get-Content -ReadCount 0 and then processing from memory. This part is relatively fast (<1min).

The problem comes when processing each line, even though it's in memory. It is taking hours for a 75MB file with 561178 legitimate lines of data (minus all the blank lines).

I'm not doing much in the code itself. I'm doing the following:

  • Splitting line via space as delimiter
  • One of the fields is an IP address that I am reverse DNS resolving, which is obviously going to be slow. So I have wrapped this into more code to create an in-memory arraylist cache of previously resolved IPs and pulling from it when possible. The IPs are largely the same so after a few hundred lines, resolution shouldn't be an issue any longer.
  • Saving the needed array elements into my pscustomobject
  • Adding pscustomobject to arraylist to be used later.
  • During the loop I'm tracking how many lines I've processed and outputting that info in a progress bar (I know this adds extra time but not sure how much). I really want to know progress.

All in all, it's processing some 30-40 lines per second, but obviously this is not very fast.

Can someone offer alternative methods/objectTypes to accomplish my goals and speed this up tremendously?

Below are some samples of the log with the field shift (Note this is a Windows DNS Debug log) as well as the code below that.

10/31/2022 12:38:45 PM 2D00 PACKET  000000B25A583FE0 UDP Snd 127.0.0.1      6c94 R Q [8385 A DR NXDOMAIN] AAAA   (4)pool(3)ntp(3)org(0)

10/31/2022 12:38:45 PM 2D00 PACKET  000000B25A582050 UDP Snd 127.0.0.1    3d9d R Q [8081   DR  NOERROR] A      (4)pool(3)ntp(3)org(0)

NOTE:  the issue in this case being [8385 A DR NXDOMAIN] (4 fields) vs [8081   DR  NOERROR] (3 fields)

Other examples would be the "R Q" where sometimes it's "  Q".

$Logfile = "C:\Temp\log.txt"

[System.Collections.ArrayList]$LogEntries = @()
[System.Collections.ArrayList]$DNSCache = @()

# Initialize log iteration counter
$i = 1

# Get Log data.  Read entire log into memory and save only lines that begin with a date (ignoring blank lines)
$LogData = Get-Content $Logfile -ReadCount 0 | % {$_ | ? {$_ -match "^\d \/"}}
$LogDataTotalLines = $LogData.Length

# Process each log entry
$LogData | ForEach-Object {
        
        $PercentComplete =  [math]::Round(($i/$LogDataTotalLines * 100))
        Write-Progress -Activity "Processing log file . . ." -Status "Processed $i of $LogDataTotalLines entries ($PercentComplete%)" -PercentComplete $PercentComplete
        
        # Split line using space, including sequential spaces, as delimiter.  
        # NOTE:  Due to how app logs events, some fields may be blank leading split yielding different number of columns.  Fortunately the fields we desire
        #          are in static positions not affected by this, except for the last 2, which can be referenced backwards with -2 and -1.
        $temp = $_ -Split '\s '
        
        # Resolve DNS name of IP address for later use and cache into arraylist to avoid DNS lookup for same IP as we loop through log
        If ($DNSCache.IP -notcontains $temp[8]) {
            $DNSEntry = [PSCustomObject]@{
                IP = $temp[8]
                DNSName = Resolve-DNSName $temp[8] -QuickTimeout -DNSOnly -ErrorAction SilentlyContinue | Select -ExpandProperty NameHost
            }
            
            # Add DNSEntry to DNSCache collection
            $DNSCache.Add($DNSEntry) | Out-Null
            
            # Set resolved DNS name to that which came back from Resolve-DNSName cmdlet. NOTE:  value could be blank.
            $ResolvedDNSName = $DNSEntry.DNSName
        } Else {
            # DNSCache contains resolved IP already.  Find and Use it.
            $ResolvedDNSName = ($DNSCache | ? {$_.IP -eq $temp[8]}).DNSName
        }

        $LogEntry = [PSCustomObject]@{
            Datetime = $temp[0]   " "   $temp[1]   " "   $temp[2]          # Combines first 3 fields Date, Time, AM/PM
            ClientIP = $temp[8]
            ClientDNSName = $ResolvedDNSName
            QueryType = $temp[-2]       # Second to last entry of array
            QueryName = ($temp[-1] -Replace "\(\d \)",".") -Replace "^\.",""        # Last entry of array.  Replace any "(#)" characters with period and remove first period for friendly name
        }
        
        # Add LogEntry to LogEntries collection
        $LogEntries.Add($LogEntry) | Out-Null
       $i  
    }

CodePudding user response:

Here is a more optimized version you can try.

What changed?:

  • Removed Write-Progress, especially because it's not known if Windows PowerShell is used. PowerShell versions below 6 have a big performance impact with Write-Progress
  • Changed $DNSCache to Generic Dictionary for fast lookups
  • Changed $LogEntries to Generic List
  • Switched from Get-Content to switch -Regex -File
$Logfile = 'C:\Temp\log.txt'

$LogEntries = [System.Collections.Generic.List[psobject]]::new()
$DNSCache   = [System.Collections.Generic.Dictionary[string, psobject]]::new([System.StringComparer]::OrdinalIgnoreCase)

# Process each log entry
switch -Regex -File ($Logfile) {
    '^\d \/' {
        # Split line using space, including sequential spaces, as delimiter.  
        # NOTE:  Due to how app logs events, some fields may be blank leading split yielding different number of columns.  Fortunately the fields we desire
        #          are in static positions not affected by this, except for the last 2, which can be referenced backwards with -2 and -1.
        $temp = $_ -Split '\s '
        $ip   = [string] $temp[8]

        $resolvedDNSRecord = $DNSCache[$ip]

        if ($null -eq $resolvedDNSRecord) {
            $resolvedDNSRecord = [PSCustomObject]@{
                IP      = $ip 
                DNSName = Resolve-DnsName $ip -QuickTimeout -DnsOnly -ErrorAction Ignore | select -ExpandProperty NameHost
            }
            $DNSCache[$ip] = $resolvedDNSRecord
        }

        $LogEntry = [PSCustomObject]@{
            Datetime      = $temp[0]   ' '   $temp[1]   ' '   $temp[2]          # Combines first 3 fields Date, Time, AM/PM
            ClientIP      = $ip
            ClientDNSName = $resolvedDNSRecord.DNSName
            QueryType     = $temp[-2]       # Second to last entry of array
            QueryName     = ($temp[-1] -Replace '\(\d \)', '.') -Replace '^\.', ''        # Last entry of array.  Replace any "(#)" characters with period and remove first period for friendly name
        }

        # Add LogEntry to LogEntries collection
        $LogEntries.Add($LogEntry)
    }
}

If it's still slow, there is still the option to use Start-ThreadJob as a multithreading approach with chunked lines (like 10000 per job).

  • Related