I have a text file with a large number of log messages. I want to extract the messages between two string patterns. I want the extracted message to appear as it is in the text file.
I tried the following methods. It works, but doesn't support Get-Content's -Wait and -Tail options. Also, the extracted results are displayed in one line, but not like the text file. Inputs are welcome :-)
Sample Code
function GetTextBetweenTwoStrings($startPattern, $endPattern, $filePath){
# Get content from the input file
$fileContent = Get-Content $filePath
# Regular expression (Regex) of the given start and end patterns
$pattern = "$startPattern(.*?)$endPattern"
# Perform the Regex opperation
$result = [regex]::Match($fileContent,$pattern).Value
# Finally return the result to the caller
return $result
}
# Clear the screen
Clear-Host
$input = "THE-LOG-FILE.log"
$startPattern = 'START-OF-PATTERN'
$endPattern = 'END-OF-PATTERN'
# Call the function
GetTextBetweenTwoStrings -startPattern $startPattern -endPattern $endPattern -filePath $input
Improved script based on Theo's answer. The following points need to be improved:
- The beginning and end of the output is somehow trimmed despite I adjusted the buffer size in the script.
- How to wrap each matched result into START and END string?
- Still I could not figure out how to use the
-Wait
and-Tail
options
Updated Script
# Clear the screen
Clear-Host
# Adjust the buffer size of the window
$bw = 10000
$bh = 300000
if ($host.name -eq 'ConsoleHost') # or -notmatch 'ISE'
{
[console]::bufferwidth = $bw
[console]::bufferheight = $bh
}
else
{
$pshost = get-host
$pswindow = $pshost.ui.rawui
$newsize = $pswindow.buffersize
$newsize.height = $bh
$newsize.width = $bw
$pswindow.buffersize = $newsize
}
function Get-TextBetweenTwoStrings ([string]$startPattern, [string]$endPattern, [string]$filePath){
# Get content from the input file
$fileContent = Get-Content -Path $filePath -Raw
# Regular expression (Regex) of the given start and end patterns
$pattern = '(?is){0}(.*?){1}' -f [regex]::Escape($startPattern), [regex]::Escape($endPattern)
# Perform the Regex operation and output
[regex]::Match($fileContent,$pattern).Groups[1].Value
}
# Input file path
$inputFile = "THE-LOG-FILE.log"
# The patterns
$startPattern = 'START-OF-PATTERN'
$endPattern = 'END-OF-PATTERN'
Get-TextBetweenTwoStrings -startPattern $startPattern -endPattern $endPattern -filePath $inputFile
CodePudding user response:
First of all, you should not use $input
as self-defined variable name, because this is an Automatic variable.
Then, you are reading the file as a string array, where you would rather read is as a single, multiline string. For that append switch -Raw
to the Get-Content call.
The regex you are creating does not allow fgor regex special characters in the start- and end patterns you give, so it I would suggest using [regex]::Escape()
on these patterns when creating the regex string.
While your regex does use a group capturing sequence inside the brackets, you are not using that when it comes to getting the value you seek.
Finally, I would recommend using PowerShell naming convention (Verb-Noun) for the function name
Try
function Get-TextBetweenTwoStrings ([string]$startPattern, [string]$endPattern, [string]$filePath){
# Get content from the input file
$fileContent = Get-Content -Path $filePath -Raw
# Regular expression (Regex) of the given start and end patterns
$pattern = '(?is){0}(.*?){1}' -f [regex]::Escape($startPattern), [regex]::Escape($endPattern)
# Perform the Regex operation and output
[regex]::Match($fileContent,$pattern).Groups[1].Value
}
$inputFile = "D:\Test\THE-LOG-FILE.log"
$startPattern = 'START-OF-PATTERN'
$endPattern = 'END-OF-PATTERN'
Get-TextBetweenTwoStrings -startPattern $startPattern -endPattern $endPattern -filePath $inputFile
Would result in something like:
blahblah
more lines here
The (?is)
makes the regex case-insensitive and have the dot match linebreaks as well
CodePudding user response:
You need to perform streaming processing of your
Get-Content
call, in a pipeline, such as withForEach-Object
, if you want to process lines as they're being read.- This is a must if you're using
Get-Content -Wait
, as this call doesn't terminate by itself (keeps waiting for new lines indefinitely), but inside a pipeline its output can be processed as it is being received.
- This is a must if you're using
You're trying to match across multiple lines, which with
Get-Content
output would only work if you used the-Raw
switch - by default,Get-Content
reads its input file line by line.- However,
-Raw
is incompatible with-Wait
. - Therefore, you need to match the start and end patterns separately, and keep track of when you're processing lines between those two patterns.
- However,
Here's a proof of concept, but note the following:
-Tail 100
is hard-coded - adjust as needed or make it another parameter.The use of
-Wait
means that the function will run indefinitely - waiting for new lines to be added to$filePath
- so you'll need to use Ctrl-C to stop it.While you can use a
Get-TextBetweenTwoStrings
call itself in a pipeline for object-by-object processing, assigning its result to a variable ($result = ...
) won't work when terminating with Ctrl-C.To work around this limitation, the function below is defined as an advanced function, which automatically enables support for the common
-OutVariable
parameter, which is populated even in the event of termination with Ctrl-C; your sample call would then look as follows (as Theo notes, don't use the automatic$input
variable as a custom variable):# Look for blocks of interest in the input file, indefinitely, # and output them as they're being found. # After termination with Ctrl-C, $result will also contain the blocks # found, if any. Get-TextBetweenTwoStrings -OutVariable result -startPattern $startPattern -endPattern $endPattern -filePath $inputFile
The word pattern in your
$startPattern
and$endPattern
parameters suggests that they are regexes that can therefore be used as-is or embedded as-is in a larger regex with the-match
as shown below.If you want them to be treated as literal strings, escape them with
[regex]::Escape()
, e.g.:$_ -match ('.*?' [regex]::Escape($endPattern))
Separately, if you want the block of lines to encompass the full lines on which the start and end patterns match (assuming the patterns themselves aren't anchored, such as with
^
and$
), prepend.*?
/ append.*
in the regexes below.
# Note the use of "-" after "Get", to adhere to PowerShell's
# "<Verb>-<Noun>" naming convention.
function Get-TextBetweenTwoStrings {
# Make the function an advanced one, so that it supports the
# -OutVariable common parameter.
[CmdletBinding()]
param(
$startPattern,
$endPattern,
$filePath
)
$inBlock = false
$block = [System.Collections.Generic.List[string]]::new()
Get-Content -Tail 100 -Wait $filePath | ForEach-Object {
if ($inBlock) {
if ($_ -match ".*?$endPattern") {
$block.Add($Matches[0])
# Output the block of lines as a single, multi-line string
$block -join "`n"
$block.Clear()
}
else {
$block.Add($_)
}
}
elseif ($_ -match "$startPattern.*") {
$inBlock = $true
$block.Add($Matches[0])
}
}
}