I am trying to store a text file string which has a beginning and end that make it a substring of the original text file. I am new to Powershell so my methods are simple/crude. Basically my approach has been:
- Roughly get what I want from the start of the string
- Worry about trimming off what I don't want later
My minimum reproducible example is as follows:
# selectStringTest.ps
$inputFile = Get-Content -Path "C:\test\test3\Copy of 31832_226140__0001-00006.txt"
# selected text string needs to span from $refName up to $boundaryName
[string]$refName = "001 BARTLETT"
[string]$boundaryName = "001 BEECH"
# a rough estimate of the text file lines required
[int]$lines = 200
if (Select-String -InputObject $inputFile -pattern $refName) {
Write-Host "Selected shortened string found!"
# this selects the start of required string but with extra text
[string]$newFileStart = $inputFile | Select-String $refName -CaseSensitive -SimpleMatch -Context 0, $lines
}
else {
Write-Host "Selected string NOT FOUND."
}
# tidy up the start of the string by removing rubbish
$newFileStart = $newFileStart.TrimStart('> ')
# this is the kind of thing I want but it doesn't work
$newFileStart = $newFileStart - $newFileStart.StartsWith($boundaryName)
$newFileStart | Out-File tempOutputFile
As it is: the output begins correctly but I cannot remove text including and after $boundaryName
The original text file is OCR generated (Optical Character Recognition) So it is unevenly formatted. There are newlines in odd places. So I have limited options when it comes to delimiting.
I am not sure my if (Select-String -InputObject $inputFile -pattern $refName)
is valid. It appears to work correctly. The general design seems crude. In that I am guessing how many lines I will need. And finally I have tried various methods of trimming the string from $boundaryName
without success. For this:
- string.split() not practical
- replacing spaces with newlines in an array & looping through to elements of $boundaryName is possible but I don't know how to terminate the array at this point before returning it to string.
Any suggestions would be appreciated.
Abbreviated content of x2 200 listings single Copy of 31832_226140__0001-00006.txt
file is:
Beginning of text file
________________
BARTLETT-BEDGGOOD
PENCARROW COMPOSITE ROLL
PAGE 6
PAGE 7
PENCARROW COMPOSITE ROLL
BEECH-BEST
www.
.......................
001 BARTLETT. Lois Elizabeth
Middle of text file
............. 15 St Ronans Av. Lower Hutt Marned 200 BEDGGOOD. Percy Lloyd
............15 St Ronans Av, Lower Mutt. Coachbuild
001 BEECH, Margaret ..........
End of text file
..............312 Munita Rood Eastbourne, Civil Eng 200 BEST, Dons Amy .........
..........50 Man Street, Wamuomata, Marned
SO NON
CodePudding user response:
To use a regex across newlines, the file needs to be read as a single string. Get-Content -Raw
will do that. This assumes that you do not want the lines containing refName and boundaryName included in the output
$c = Get-Content -Path '.\beech.txt' -Raw
$refName = "001 BARTLETT"
$boundaryName = "001 BEECH"
if ($c -match "(?smi).*$refName.*?`r`n(.*)$boundaryName.*?`r`n.*") {
$result = $Matches[1]
}
$result
More information at https://stackoverflow.com/a/12573413/447901
CodePudding user response:
How close does this come to what you want?
function Process-File {
param (
[Parameter(Mandatory = $true, Position = 0)]
[string]$HeadText,
[Parameter(Mandatory = $true, Position = 1)]
[string]$TailText,
[Parameter(ValueFromPipeline)]
$File
)
Process {
$Inside = $false;
switch -Regex -File $File.FullName {
#'^\s*$' { continue }
"(?i)^\s*$TailText(?<Tail>.*)`$" { $Matches.Tail; $Inside = $false }
'^(?<Line>. )$' { if($Inside) { $Matches.Line } }
"(?i)^\s*$HeadText(?<Head>.*)`$" { $Matches.Head; $Inside = $true }
default { continue }
}
}
}
$File = 'Copy of 31832_226140__0001-00006.txt'
#$Path = $PSScriptRoot
$Path = 'C:\test\test3'
$Result = Get-ChildItem -Path "$Path\$File" | Process-File '001 BARTLETT' '001 BEECH'
$Result | Out-File -FilePath "$Path\SpanText.txt"
This is the output:
. Lois Elizabeth
............. 15 St Ronans Av. Lower Hutt Marned 200 BEDGGOOD. Percy Lloyd
............15 St Ronans Av, Lower Mutt. Coachbuild
, Margaret ..........