When matching rough text file via regex I want to match strings between and including matches; before writing these to a file. Because of the random nature of the text I cannot delimit matches using full stops or newlines. However I can write the matched component of the string. But I want the whole string.
Example text file has jumbled entries
# this text file replicates a jumbled double listing
001 AALTON, Alan .....25 Every Street006 JOHNS, Jason .... 3 Steep Street 002 BROWN,
James .... 101 Browns Road
005 INSTONE, June .... 14 Dover Road 003 BROWN, Jemmima ........ 101 Browns Road 004 BROWN, John .......
101 Browns Road
005 CAMPBELL, Colin ..... 57 Camp Avenue
004 HANNAH, Harold ....7 Right Way
006 DONNAGAN, Dolores ...11 Main Road
001 EMERSON, John .... 13 South Street
002 FLANAGAN, Lesley .... 1 Lovers Lane
003 GREGG, Ian ..... 101 Short Street
Using
$regexRen = '\d\d\d\s[A-Z][A-Z][A-Z]'
$result = select-string -AllMatches -Path $input_path -Pattern $regexRen -CaseSensitive | % { $_.Matches } | % { $_.Value }
I can output the following in alphabetical order with this kind of output:
001 AAL
002 BRO
003 BRO
004 BRO
005 CAM
006 DON
001 EME
002 FLA
003 GRE
What I want is the whole string until the next match. And that, I would assume, this would also take care of the multiple BROWN name. That is return the correct alphabetical order of the Brown's personal names.
Any suggestions would be appreciated.
CodePudding user response:
Using Wiktor's beautiful Regex you can do something like this
$regexRen = '(?s)(?<!\d)\d{3}\s[A-Z]{3}.*?\D(?=\d{3}\s[A-Z]{3}|$)'
Select-String -AllMatches -Path $input_path -Pattern $regexRen -CaseSensitive |
ForEach-Object { $_.Matches.Value -replace '[\n\r]' } |
Sort-Object { $_.substring(4, 5) }
Output
001 AALTON, Alan .....25 Every Street
002 BROWN,James .... 101 Browns Road
003 BROWN, Jemmima ........ 101 Browns Road
004 BROWN, John .......101 Browns Road
005 CAMPBELL, Colin ..... 57 Camp Avenue
006 DONNAGAN, Dolores ...11 Main Road
001 EMERSON, John .... 13 South Street
002 FLANAGAN, Lesley .... 1 Lovers Lane
003 GREGG, Ian ..... 101 Short Street
004 HANNAH, Harold ....7 Right Way
005 INSTONE, June .... 14 Dover Road
006 JOHNS, Jason .... 3 Steep Street
CodePudding user response:
I suggest using the -split
operator, combined with Group-Object
and Sort-Object
:
$i = @{ Index = 0 }
(Get-Content -Raw $input_path) -csplit '(\d{3}\s[A-Z]{3})' -ne '' |
Group-Object { [math]::Floor($i.Index / 2) } |
ForEach-Object { -join ($_.Group -replace '\r?\n', ' ') } |
Sort-Object { (-split $_)[1] }
Output with your sample input (with each output string enclosed in «...»
to illustrate the individual resulting strings):
«001·AALTON,·Alan·.....25·Every·Street»
«002·BROWN,·James·....·101·Browns·Road·»
«003·BROWN,·Jemmima·........·101·Browns·Road·»
«004·BROWN,·John·.......··101·Browns·Road··»
«005·CAMPBELL,·Colin·.....·57·Camp·Avenue·»
«006·DONNAGAN,·Dolores·...11·Main·Road·»
«001·EMERSON,·John·....·13·South·Street·»
«002·FLANAGAN,·Lesley·....·1·Lovers·Lane·»
«003·GREGG,·Ian·.....·101·Short·Street·»
«004·HANNAH,·Harold·....7·Right·Way·»
«005·INSTONE,·June·....·14·Dover·Road·»
«006·JOHNS,·Jason·....·3·Steep·Street·»