I am trying to parse a very large log file that consists of space delimited text across about 16 fields. Unfortunately the app logs a blank line in between each legitimate one (arbitrarily doubling the lines I must process). It also causes fields to shift because it uses space as both a delineator as well as for empty fields. I couldn't get around this in LogParser. Fortunately Powershell affords me the option to reference fields from the end as well making it easier to get later fields affected by shift.
After a bit of testing with smaller sample files, I've determined that processing line by line as the file is streaming with Get-Content natively is slower than just reading the file completely using Get-Content -ReadCount 0 and then processing from memory. This part is relatively fast (<1min).
The problem comes when processing each line, even though it's in memory. It is taking hours for a 75MB file with 561178 legitimate lines of data (minus all the blank lines).
I'm not doing much in the code itself. I'm doing the following:
- Splitting line via space as delimiter
- One of the fields is an IP address that I am reverse DNS resolving, which is obviously going to be slow. So I have wrapped this into more code to create an in-memory arraylist cache of previously resolved IPs and pulling from it when possible. The IPs are largely the same so after a few hundred lines, resolution shouldn't be an issue any longer.
- Saving the needed array elements into my pscustomobject
- Adding pscustomobject to arraylist to be used later.
- During the loop I'm tracking how many lines I've processed and outputting that info in a progress bar (I know this adds extra time but not sure how much). I really want to know progress.
All in all, it's processing some 30-40 lines per second, but obviously this is not very fast.
Can someone offer alternative methods/objectTypes to accomplish my goals and speed this up tremendously?
Below are some samples of the log with the field shift (Note this is a Windows DNS Debug log) as well as the code below that.
10/31/2022 12:38:45 PM 2D00 PACKET 000000B25A583FE0 UDP Snd 127.0.0.1 6c94 R Q [8385 A DR NXDOMAIN] AAAA (4)pool(3)ntp(3)org(0)
10/31/2022 12:38:45 PM 2D00 PACKET 000000B25A582050 UDP Snd 127.0.0.1 3d9d R Q [8081 DR NOERROR] A (4)pool(3)ntp(3)org(0)
NOTE: the issue in this case being [8385 A DR NXDOMAIN] (4 fields) vs [8081 DR NOERROR] (3 fields)
Other examples would be the "R Q" where sometimes it's " Q".
$Logfile = "C:\Temp\log.txt"
[System.Collections.ArrayList]$LogEntries = @()
[System.Collections.ArrayList]$DNSCache = @()
# Initialize log iteration counter
$i = 1
# Get Log data. Read entire log into memory and save only lines that begin with a date (ignoring blank lines)
$LogData = Get-Content $Logfile -ReadCount 0 | % {$_ | ? {$_ -match "^\d \/"}}
$LogDataTotalLines = $LogData.Length
# Process each log entry
$LogData | ForEach-Object {
$PercentComplete = [math]::Round(($i/$LogDataTotalLines * 100))
Write-Progress -Activity "Processing log file . . ." -Status "Processed $i of $LogDataTotalLines entries ($PercentComplete%)" -PercentComplete $PercentComplete
# Split line using space, including sequential spaces, as delimiter.
# NOTE: Due to how app logs events, some fields may be blank leading split yielding different number of columns. Fortunately the fields we desire
# are in static positions not affected by this, except for the last 2, which can be referenced backwards with -2 and -1.
$temp = $_ -Split '\s '
# Resolve DNS name of IP address for later use and cache into arraylist to avoid DNS lookup for same IP as we loop through log
If ($DNSCache.IP -notcontains $temp[8]) {
$DNSEntry = [PSCustomObject]@{
IP = $temp[8]
DNSName = Resolve-DNSName $temp[8] -QuickTimeout -DNSOnly -ErrorAction SilentlyContinue | Select -ExpandProperty NameHost
}
# Add DNSEntry to DNSCache collection
$DNSCache.Add($DNSEntry) | Out-Null
# Set resolved DNS name to that which came back from Resolve-DNSName cmdlet. NOTE: value could be blank.
$ResolvedDNSName = $DNSEntry.DNSName
} Else {
# DNSCache contains resolved IP already. Find and Use it.
$ResolvedDNSName = ($DNSCache | ? {$_.IP -eq $temp[8]}).DNSName
}
$LogEntry = [PSCustomObject]@{
Datetime = $temp[0] " " $temp[1] " " $temp[2] # Combines first 3 fields Date, Time, AM/PM
ClientIP = $temp[8]
ClientDNSName = $ResolvedDNSName
QueryType = $temp[-2] # Second to last entry of array
QueryName = ($temp[-1] -Replace "\(\d \)",".") -Replace "^\.","" # Last entry of array. Replace any "(#)" characters with period and remove first period for friendly name
}
# Add LogEntry to LogEntries collection
$LogEntries.Add($LogEntry) | Out-Null
$i
}
CodePudding user response:
Here is a more optimized version you can try.
What changed?:
- Removed
Write-Progress
, especially because it's not known if Windows PowerShell is used. PowerShell versions below 6 have a big performance impact withWrite-Progress
- Changed
$DNSCache
to Generic Dictionary for fast lookups - Changed
$LogEntries
to Generic List - Switched from
Get-Content
toswitch -Regex -File
$Logfile = 'C:\Temp\log.txt'
$LogEntries = [System.Collections.Generic.List[psobject]]::new()
$DNSCache = [System.Collections.Generic.Dictionary[string, psobject]]::new([System.StringComparer]::OrdinalIgnoreCase)
# Process each log entry
switch -Regex -File ($Logfile) {
'^\d \/' {
# Split line using space, including sequential spaces, as delimiter.
# NOTE: Due to how app logs events, some fields may be blank leading split yielding different number of columns. Fortunately the fields we desire
# are in static positions not affected by this, except for the last 2, which can be referenced backwards with -2 and -1.
$temp = $_ -Split '\s '
$ip = [string] $temp[8]
$resolvedDNSRecord = $DNSCache[$ip]
if ($null -eq $resolvedDNSRecord) {
$resolvedDNSRecord = [PSCustomObject]@{
IP = $ip
DNSName = Resolve-DnsName $ip -QuickTimeout -DnsOnly -ErrorAction Ignore | select -ExpandProperty NameHost
}
$DNSCache[$ip] = $resolvedDNSRecord
}
$LogEntry = [PSCustomObject]@{
Datetime = $temp[0] ' ' $temp[1] ' ' $temp[2] # Combines first 3 fields Date, Time, AM/PM
ClientIP = $ip
ClientDNSName = $resolvedDNSRecord.DNSName
QueryType = $temp[-2] # Second to last entry of array
QueryName = ($temp[-1] -Replace '\(\d \)', '.') -Replace '^\.', '' # Last entry of array. Replace any "(#)" characters with period and remove first period for friendly name
}
# Add LogEntry to LogEntries collection
$LogEntries.Add($LogEntry)
}
}
If it's still slow, there is still the option to use Start-ThreadJob as a multithreading approach with chunked lines (like 10000 per job).