Home > OS >  Powershell regex failing with unexpected results
Powershell regex failing with unexpected results

Time:06-25

I'm trying to isolate a block of lines if a certain word is present inside anywhere in the block.

For example I have the following text.

A1: blah blah
B1: blah blah foobar

A2: foobar blah blah
B2: blah blah foobar
C2: blah blah
D2: blah blah

A3: blah blah
B3: blah blah
C3: blah blah

The blocks could contain any number of lines, separated by empty line and the word I'm looking for foobarcan be anywhere in the block this is the only constant, the Starting Numbers list(A1,B1 etc)is for the sake of simplicity these change completely.

This is the regex I could come up with. Obviously this does not adjust with the dynamic nature of the block line size. But atleast this worked as it's suppose to.

.*[\r\n] .*(foobar).*[\r\n] (.*[\n\r]){1}

Result:
A1: blah blah
B1: blah blah foobar

A2: foobar blah blah
B2: blah blah foobar
C2: blah blah

I was able to further refine the regex and came up with the following:

(.\n?)*(foobar).*(\n?.)*

Result:
A1: blah blah
B1: blah blah foobar

A2: foobar blah blah
B2: blah blah foobar
C2: blah blah
D2: blah blah

Exactly what I needed and worked perfectly on every online regex testing sites I tried. But once I put this in Powershell the code just spits out everything, no filtering nothing.

Here's the code I'm working with:

$regex = '(.\n?)*(foobar).*(\n?.)*'

$response = Invoke-RestMethod $url
$response | Select-String $regex -AllMatches | ForEach-Object {
    foreach($foorbar in $_.matches.Value) {
        $foobar | Out-File $fileOutput -Append
    }
} 

The URL contains webpage with these blocks of data nothing else. Everything is spit out as it is with the new regex without any parsing but the old one works as its suppose to. So I'm assuming its something wrong with the regex.

If anyone could point out whats wrong here much appreciated!

CodePudding user response:

You can use

$regex = '(?m)^(?:. \n)*?.*foobar.*(?:\n. )*'

See the regex demo. Details:

  • (?m) - a RegexOptions.Multiline option
  • ^ - start of any line
  • (?:. \n)*? - any zero or more (but as few as possible) non-empty lines
  • .*foobar.* - a line containing foobar
  • (?:\n. )* - zero or more (as many as possible) non-empty lines.

In PowerShell, you can also use

$regex = '(?m)^(?:. \n)*?.*foobar.*(?:\n. )*'

$response = Invoke-RestMethod $url
($response | Select-String -Pattern $regex -AllMatches | %{ $_.Matches.Value }) -join "`r`n`r`n" >> $fileOutput

CodePudding user response:

Splitting the input text first can simplify the solution:

$lf = [Environment]::NewLine
$response -split '\r?\n\r?\n' -match 'foobar' -join "$lf$lf" >>$fileOutput
  • $response -split '\r?\n\r?\n' splits the text on double new-lines (empty lines). The pattern \r?\n matches a single new-line, for both Windows \r\n and Unix \n flavor. The result is an array of text blocks, with the double new-lines removed.
  • -match 'foobar' filters the array of text blocks, resulting in only these blocks that contain 'foobar'. Note that the -match operator works differently depending on whether the LHS operand is a single string or an array of strings. In case of a single string the result would be a boolean instead that indicates whether the pattern matches.
  • Using -join "$lf$lf" we concatenate the matched text blocks into a single string again to produce the desired output.
  • Finally the redirection operator >> appends the string to the output file (you may use | Out-File -Append as well).
  • Related