Home > Back-end >  PowerShell Extract text between two strings with -Tail and -Wait
PowerShell Extract text between two strings with -Tail and -Wait

Time:03-15

I have a text file with a large number of log messages. I want to extract the messages between two string patterns. I want the extracted message to appear as it is in the text file.

I tried the following methods. It works, but doesn't support Get-Content's -Wait and -Tail options. Also, the extracted results are displayed in one line, but not like the text file. Inputs are welcome :-)

Sample Code

function GetTextBetweenTwoStrings($startPattern, $endPattern, $filePath){

    # Get content from the input file
    $fileContent = Get-Content $filePath

    # Regular expression (Regex) of the given start and end patterns
    $pattern = "$startPattern(.*?)$endPattern"

    # Perform the Regex opperation
    $result = [regex]::Match($fileContent,$pattern).Value

    # Finally return the result to the caller
    return $result
}

# Clear the screen
Clear-Host

$input = "THE-LOG-FILE.log"
$startPattern = 'START-OF-PATTERN'
$endPattern = 'END-OF-PATTERN'

# Call the function
GetTextBetweenTwoStrings -startPattern $startPattern -endPattern $endPattern -filePath $input

Improved script based on Theo's answer. The following points need to be improved:

  1. The beginning and end of the output is somehow trimmed despite I adjusted the buffer size in the script.
  2. How to wrap each matched result into START and END string?
  3. Still I could not figure out how to use the -Wait and -Tail options

Updated Script

# Clear the screen
Clear-Host

# Adjust the buffer size of the window
$bw = 10000
$bh = 300000
if ($host.name -eq 'ConsoleHost') # or -notmatch 'ISE'
{
  [console]::bufferwidth = $bw
  [console]::bufferheight = $bh
}
else
{
    $pshost = get-host
    $pswindow = $pshost.ui.rawui
    $newsize = $pswindow.buffersize
    $newsize.height = $bh
    $newsize.width = $bw
    $pswindow.buffersize = $newsize
}


function Get-TextBetweenTwoStrings ([string]$startPattern, [string]$endPattern, [string]$filePath){
    # Get content from the input file
    $fileContent = Get-Content -Path $filePath -Raw
    # Regular expression (Regex) of the given start and end patterns
    $pattern = '(?is){0}(.*?){1}' -f [regex]::Escape($startPattern), [regex]::Escape($endPattern)
    # Perform the Regex operation and output
    [regex]::Match($fileContent,$pattern).Groups[1].Value
}

# Input file path
 $inputFile = "THE-LOG-FILE.log"

# The patterns
$startPattern = 'START-OF-PATTERN'
$endPattern = 'END-OF-PATTERN'


Get-TextBetweenTwoStrings -startPattern $startPattern -endPattern $endPattern -filePath $inputFile

CodePudding user response:

First of all, you should not use $input as self-defined variable name, because this is an Automatic variable.

Then, you are reading the file as a string array, where you would rather read is as a single, multiline string. For that append switch -Raw to the Get-Content call.

The regex you are creating does not allow fgor regex special characters in the start- and end patterns you give, so it I would suggest using [regex]::Escape() on these patterns when creating the regex string.

While your regex does use a group capturing sequence inside the brackets, you are not using that when it comes to getting the value you seek.

Finally, I would recommend using PowerShell naming convention (Verb-Noun) for the function name

Try

function Get-TextBetweenTwoStrings ([string]$startPattern, [string]$endPattern, [string]$filePath){
    # Get content from the input file
    $fileContent = Get-Content -Path $filePath -Raw
    # Regular expression (Regex) of the given start and end patterns
    $pattern = '(?is){0}(.*?){1}' -f [regex]::Escape($startPattern), [regex]::Escape($endPattern)
    # Perform the Regex operation and output
    [regex]::Match($fileContent,$pattern).Groups[1].Value
}

$inputFile    = "D:\Test\THE-LOG-FILE.log"
$startPattern = 'START-OF-PATTERN'
$endPattern   = 'END-OF-PATTERN'

Get-TextBetweenTwoStrings -startPattern $startPattern -endPattern $endPattern -filePath $inputFile

Would result in something like:

blahblah
more lines here

The (?is) makes the regex case-insensitive and have the dot match linebreaks as well

CodePudding user response:

  • You need to perform streaming processing of your Get-Content call, in a pipeline, such as with ForEach-Object, if you want to process lines as they're being read.

    • This is a must if you're using Get-Content -Wait, as this call doesn't terminate by itself (keeps waiting for new lines indefinitely), but inside a pipeline its output can be processed as it is being received.
  • You're trying to match across multiple lines, which with Get-Content output would only work if you used the -Raw switch - by default, Get-Content reads its input file line by line.

    • However, -Raw is incompatible with -Wait.
    • Therefore, you need to match the start and end patterns separately, and keep track of when you're processing lines between those two patterns.

Here's a proof of concept, but note the following:

  • -Tail 100 is hard-coded - adjust as needed or make it another parameter.

  • The use of -Wait means that the function will run indefinitely - waiting for new lines to be added to $filePath - so you'll need to use Ctrl-C to stop it.

    • While you can use a Get-TextBetweenTwoStrings call itself in a pipeline for object-by-object processing, assigning its result to a variable ($result = ...) won't work when terminating with Ctrl-C.

    • To work around this limitation, the function below is defined as an advanced function, which automatically enables support for the common -OutVariable parameter, which is populated even in the event of termination with Ctrl-C; your sample call would then look as follows (as Theo notes, don't use the automatic $input variable as a custom variable):

      # Look for blocks of interest in the input file, indefinitely,
      # and output them as they're being found.
      # After termination with Ctrl-C, $result will also contain the blocks
      # found, if any.
      Get-TextBetweenTwoStrings -OutVariable result -startPattern $startPattern -endPattern $endPattern -filePath $inputFile
      
  • The word pattern in your $startPattern and $endPattern parameters suggests that they are regexes that can therefore be used as-is or embedded as-is in a larger regex with the -match as shown below.

  • If you want them to be treated as literal strings, escape them with [regex]::Escape(), e.g.:

    • $_ -match ('.*?' [regex]::Escape($endPattern))
  • Separately, if you want the block of lines to encompass the full lines on which the start and end patterns match (assuming the patterns themselves aren't anchored, such as with ^ and $), prepend .*? / append .* in the regexes below.

# Note the use of "-" after "Get", to adhere to PowerShell's
# "<Verb>-<Noun>" naming convention.
function Get-TextBetweenTwoStrings {

  # Make the function an advanced one, so that it supports the 
  # -OutVariable common parameter.
  [CmdletBinding()]
  param(
    $startPattern, 
    $endPattern, 
    $filePath
  )

  $inBlock = false
  $block = [System.Collections.Generic.List[string]]::new()

  Get-Content -Tail 100 -Wait $filePath | ForEach-Object {
    if ($inBlock) {
      if ($_ -match ".*?$endPattern") {
        $block.Add($Matches[0])
        # Output the block of lines as a single, multi-line string
        $block -join "`n"
        $block.Clear()        
      }
      else {
        $block.Add($_)
      }
    }
    elseif ($_ -match "$startPattern.*") {
      $inBlock = $true
      $block.Add($Matches[0])
    }
  }

}
  • Related