Powershell split text file into pages by delimiter-CodePudding

New to PowerShell here. Have a large text file with many similar pages overlapping at the moment. Wish to use the delimiter: "TESTING/TEST SYSTEM" which appears at the top of every page to separate them into individual pages. The raw original source always have a 1 and 0. 1 on the first line, then 0 on the second line, probably off some old mainframe system, I do not wish to use the 1 and 0 as delimiter, as I have other files I wish to run this command against with different delimiter, which do not have 1 and 0.

Here's what I found so far on StackOverflow, and is partially working:

(Get-Content -Raw inFile.txt) -split '(TESTING/TEST SYSTEM)'|
  Set-Content -LiteralPath { 'c:\test\outFile{0}.txt' -f $script:index   }

However, this keeps creating two extra files. First file only contains those 1 and 0. Second file actually contains the delimiter, striped from the rest of the content of each page. The third file has the rest of the content. This repeats till all the pages are separated, creating 3 pages for each section. I just need the delimiter to be part of each page. The 1 and 0 can be part of it as well, or removed, whichever is easier. Thanks so much for your help!

CodePudding user response：

(Get-Content -Raw inFile.txt) -split '(?=TESTING/TEST SYSTEM)' |
  Set-Content -LiteralPath { 'c:\test\outFile{0}.txt' -f $script:index   }

Note:

-split invariably matches something before the first separator match; if the input starts with a separator, the first array element returned is '' (the empty string).
- If no other tokens are empty, or if it is acceptable / desired to eliminate all empty tokens, you can simply append -ne '' to the -split operation.
If you want to make splitting case-sensitive, use -csplit instead of -split.
If you wan to ensure that the regex only matches at the start of a line, use
'(?m)(?=^TESTING/TEST SYSTEM)'
(?=...) in the separator regex is a (positive) look-ahead assertion that causes the separator to be included as part of each token, as explained below.

The binary form of the -split operator:

By default excludes what the (first) RHS operand - the separator regex - matches from the array of tokens it returns:
```
'a@b@c' -split '@' # -> 'a', 'b', 'c'
```
If you use a capture group ((...)) in the separator regex, what the capture group matches is included in the return array, as separate tokens:
```
'a@b@c' -split '(@)' # -> 'a', '@', 'b', '@', 'c'
```
If you want to include what the separator regex matches as part of each token, you must use a look-around assertion:
- With a look-ahead assertion ((?=...)) at the start of each token:
```
'a@b@c' -split '(?=@)' # -> 'a', '@b', '@c'
```
- With a look-behind assertion ((?<=...)) at the end of each token:
```
'a@b@c' -split '(?<=@)' # -> 'a@', 'b@', 'c'
```