Home > Software design >  Output Substring to Newline from a Raw Text String using Regex
Output Substring to Newline from a Raw Text String using Regex

Time:03-10

I have a name delimiter that I want to use to extract the whole line where it is found.

[string]$testString = $null

# broken text string of text & newlines which simulates $testString = Get-Content -Raw

$testString = "initial text
preliminary text
unfinished line bfore the line I want
001 BOURKE, Bridget Mary ....... ........... 13 Mahina Road, Mahina Bay.Producrs/As 002 BOURKE. David Gerard ...
line after the line I want
extra text
extra extra text"

# test1
# simulate text string before(?<content>.*)text string after - this returns "initial text" only (no newline or anything after)
# $testString -match "(?<BOURKE>.*)"

# test2
# this returns all text, including the newlines, so that $testString outputs exactly as it is defined 
$testString -match "(?s)(?<BOURKE>.*)"

#test3
# I want just the line with BOURKE

$result = $matches['BOURKE']

$result

#Test1 finds the match but only prints to the newline. #Test2 finds the match and includes all newlines. I would like to know what is the regex pattern that forces the output to begin 001 BOURKE ...

Any suggestions would be appreciated.

CodePudding user response:

I find it best to have a match consume up to what is not needed; the \r\n. That can be done with the set nomenclature with the ^ in the set such as [^\r\n] which says consume up to either a \r or a \n. Hence everything that is not a \r\n.

To do that use

$testString -match "(?<Bourke>\d\d\d\s[^\r\n] )"

Also one should try to avoid the * when one knows there will be matchable txt...the * is a greedy type that consumes everything. Usage of the , one or more, limits the match considerably because the parser doesn't have to try patterns (The zero of the *s zero or more), backtracking as its called which are patently not plausable.

CodePudding user response:

While a pure regex solution is possible (see bottom section), in this case I suggest delegating to the Select-String cmdlet, whose very purpose is to find the whole lines on which a given regex or literal substring (-SimpleMatch) matches:

(Select-String -LiteralPath file.txt -Pattern BOURKE).Line

Add -CaseSensitive for case-sensitive matching.

The following example simulates the above (-split '\r?\n' splits the multiline input string into individual lines):

(
  @'
initial text
preliminary text
unfinished line bfore the line I want
001 BOURKE, Bridget Mary ....... ........... 13 Mahina Road, Mahina Bay.Producrs/As 002 BOURKE. David Gerard ...
line after the line I want
extra text
extra extra text
'@ -split '\r?\n' |
    Select-String -Pattern BOURKE
).Line

Output:

001 BOURKE, Bridget Mary ....... ........... 13 Mahina Road, Mahina Bay.Producrs/As 002 BOURKE. David Gerard ...

If you do need a pure regex solution that operates directly on a multi-line input string:

if (
  @'
initial text
preliminary text
unfinished line bfore the line I want
001 BOURKE, Bridget Mary ....... ........... 13 Mahina Road, Mahina Bay.Producrs/As 002 BOURKE. David Gerard ...
line after the line I want
extra text
extra extra text
'@ -match '.*BOURKE.*') {
  $Matches[0]
}

To match case-sensitively, use -cmatch instead of -match.

For an explanation of the regex and the ability to experiment with it, see this regex101.com page.

Note: If your input string uses Windows CRLF newlines (\r\n) instead of Unix LF newlines (\n), use the following regex instead, to avoid capturing the CR (\r) at the end of the line:

'.*BOURKE[^\r\n]*'

  • Related