Home > Software engineering >  Matching Strings Using Next Match as Delimiter
Matching Strings Using Next Match as Delimiter

Time:05-03

When matching rough text file via regex I want to match strings between and including matches; before writing these to a file. Because of the random nature of the text I cannot delimit matches using full stops or newlines. However I can write the matched component of the string. But I want the whole string.

Example text file has jumbled entries

# this text file replicates a jumbled double listing
001 AALTON, Alan .....25 Every Street006 JOHNS, Jason .... 3 Steep Street 002 BROWN,
James .... 101 Browns Road
005 INSTONE, June .... 14 Dover Road 003 BROWN, Jemmima ........ 101 Browns Road 004 BROWN, John ....... 
101 Browns Road 
005 CAMPBELL, Colin ..... 57 Camp Avenue
004 HANNAH, Harold ....7 Right Way
006 DONNAGAN, Dolores ...11 Main Road
001 EMERSON, John .... 13 South Street
002 FLANAGAN, Lesley .... 1 Lovers Lane
003 GREGG, Ian ..... 101 Short Street

Using

$regexRen = '\d\d\d\s[A-Z][A-Z][A-Z]'
$result = select-string -AllMatches -Path $input_path -Pattern $regexRen -CaseSensitive | % { $_.Matches } | % { $_.Value }

I can output the following in alphabetical order with this kind of output:

001 AAL
002 BRO
003 BRO
004 BRO
005 CAM
006 DON
001 EME
002 FLA
003 GRE

What I want is the whole string until the next match. And that, I would assume, this would also take care of the multiple BROWN name. That is return the correct alphabetical order of the Brown's personal names.

Any suggestions would be appreciated.

CodePudding user response:

Using Wiktor's beautiful Regex you can do something like this

$regexRen = '(?s)(?<!\d)\d{3}\s[A-Z]{3}.*?\D(?=\d{3}\s[A-Z]{3}|$)'
Select-String -AllMatches -Path $input_path -Pattern $regexRen -CaseSensitive | 
    ForEach-Object { $_.Matches.Value -replace '[\n\r]' } | 
        Sort-Object { $_.substring(4, 5) }

Output

001 AALTON, Alan .....25 Every Street
002 BROWN,James .... 101 Browns Road
003 BROWN, Jemmima ........ 101 Browns Road
004 BROWN, John .......101 Browns Road
005 CAMPBELL, Colin ..... 57 Camp Avenue
006 DONNAGAN, Dolores ...11 Main Road
001 EMERSON, John .... 13 South Street
002 FLANAGAN, Lesley .... 1 Lovers Lane
003 GREGG, Ian ..... 101 Short Street
004 HANNAH, Harold ....7 Right Way
005 INSTONE, June .... 14 Dover Road
006 JOHNS, Jason .... 3 Steep Street

CodePudding user response:

I suggest using the -split operator, combined with Group-Object and Sort-Object:

$i = @{ Index = 0 }
(Get-Content -Raw $input_path) -csplit '(\d{3}\s[A-Z]{3})' -ne '' | 
  Group-Object { [math]::Floor($i.Index   / 2) } |
    ForEach-Object { -join ($_.Group -replace '\r?\n', ' ') } |
      Sort-Object { (-split $_)[1] }

Output with your sample input (with each output string enclosed in «...» to illustrate the individual resulting strings):

«001·AALTON,·Alan·.....25·Every·Street»
«002·BROWN,·James·....·101·Browns·Road·»
«003·BROWN,·Jemmima·........·101·Browns·Road·»
«004·BROWN,·John·.......··101·Browns·Road··»
«005·CAMPBELL,·Colin·.....·57·Camp·Avenue·»
«006·DONNAGAN,·Dolores·...11·Main·Road·»
«001·EMERSON,·John·....·13·South·Street·»
«002·FLANAGAN,·Lesley·....·1·Lovers·Lane·»
«003·GREGG,·Ian·.....·101·Short·Street·»
«004·HANNAH,·Harold·....7·Right·Way·»
«005·INSTONE,·June·....·14·Dover·Road·»
«006·JOHNS,·Jason·....·3·Steep·Street·»
  • Related