Home > front end >  Powershell 7.x How to Select a Text Substring of Unknown Length Only Using Boundary Substrings
Powershell 7.x How to Select a Text Substring of Unknown Length Only Using Boundary Substrings

Time:02-23

I am trying to store a text file string which has a beginning and end that make it a substring of the original text file. I am new to Powershell so my methods are simple/crude. Basically my approach has been:

  1. Roughly get what I want from the start of the string
  2. Worry about trimming off what I don't want later

My minimum reproducible example is as follows:

# selectStringTest.ps    
         
$inputFile = Get-Content -Path "C:\test\test3\Copy of 31832_226140__0001-00006.txt"

#  selected text string needs to span from $refName up to $boundaryName 
[string]$refName = "001 BARTLETT"
[string]$boundaryName = "001 BEECH"

# a rough estimate of the text file lines required
[int]$lines = 200
   
if (Select-String  -InputObject $inputFile -pattern $refName) {
    Write-Host "Selected shortened string found!"
    # this selects the start of required string but with extra text
    [string]$newFileStart = $inputFile | Select-String $refName -CaseSensitive -SimpleMatch -Context 0, $lines   
}
else {
    Write-Host "Selected string NOT FOUND."
}
# tidy up the start of the string by removing rubbish
$newFileStart = $newFileStart.TrimStart('> ')

# this is the kind of thing I want but it doesn't work
$newFileStart = $newFileStart - $newFileStart.StartsWith($boundaryName)

$newFileStart | Out-File tempOutputFile

As it is: the output begins correctly but I cannot remove text including and after $boundaryName

The original text file is OCR generated (Optical Character Recognition) So it is unevenly formatted. There are newlines in odd places. So I have limited options when it comes to delimiting.

I am not sure my if (Select-String -InputObject $inputFile -pattern $refName)is valid. It appears to work correctly. The general design seems crude. In that I am guessing how many lines I will need. And finally I have tried various methods of trimming the string from $boundaryName without success. For this:

  • string.split() not practical
  • replacing spaces with newlines in an array & looping through to elements of $boundaryName is possible but I don't know how to terminate the array at this point before returning it to string.

Any suggestions would be appreciated.

Abbreviated content of x2 200 listings single Copy of 31832_226140__0001-00006.txt file is:

Beginning of text file

________________

BARTLETT-BEDGGOOD
PENCARROW COMPOSITE ROLL
PAGE 6
PAGE 7
PENCARROW COMPOSITE ROLL
BEECH-BEST
www.
.......................
001 BARTLETT. Lois Elizabeth

Middle of text file

............. 15 St Ronans Av. Lower Hutt Marned 200 BEDGGOOD. Percy Lloyd
............15 St Ronans Av, Lower Mutt. Coachbuild
001 BEECH, Margaret ..........

End of text file

..............312 Munita Rood Eastbourne, Civil Eng 200 BEST, Dons Amy .........
..........50 Man Street, Wamuomata, Marned
SO NON

CodePudding user response:

How close does this come to what you want?

function Process-File {
    Process {
        $Inside = $false;
        switch -Regex -File $Input.FullName {
            #'^\s*$' { continue }
            '(?i)^\s*001 BEECH(?<Tail>.*)$'   { $Matches.Tail; $Inside = $false }
            '^(?<Line>. )$'                { if($Inside) { $Matches.Line } }
            '(?i)^\s*001 BARTLETT(?<Head>.*)$' { $Matches.Head; $Inside = $true }
            default { continue }
        }
    }
}
$File = 'Copy of 31832_226140__0001-00006.txt'
#$Path = $PSScriptRoot
$Path = 'C:\test\test3'

$Result = Get-ChildItem -Path "$Path\$File" | Process-File
$Result | Out-File -FilePath "$Path\SpanText.txt"

This is the output:

. Lois Elizabeth
............. 15 St Ronans Av. Lower Hutt Marned 200 BEDGGOOD. Percy Lloyd
............15 St Ronans Av, Lower Mutt. Coachbuild
, Margaret ..........

CodePudding user response:

To use a regex across newlines, the file needs to be read as a single string. Get-Content -Raw will do that. This assumes that you do not want the lines containing refName and boundaryName included in the output

$c = Get-Content -Path '.\beech.txt' -Raw
$refName = "001 BARTLETT"
$boundaryName = "001 BEECH"

if ($c -match "(?smi).*$refName.*?`r`n(.*)$boundaryName.*?`r`n.*") {
    $result = $Matches[1]
}
$result

More information at https://stackoverflow.com/a/12573413/447901

  • Related