How can I split a string using a regular expression, to the left of the matching string?-CodePudding

I have the following sample text:

Performed by:

ID NUMBER:
XSOR-160491"
15632894,136259874,"TEXT:
Name: John    Age:80
[Lots of text spanning multiple lines with special characters, new lines and whitespaces]
XSOR-160491"
78452156,784569851,"TEXT:
Name: Sally    Age:31
[Lots of text spanning multiple lines with special characters, new lines and whitespaces]

ID NUMBER:
01236589,489456156878,"TEXT:
Name: Suraj    Age:56
[Lots of text spanning multiple lines with special characters, new lines and whitespaces]

00123795,,"TEXT:
Name: Shiloh    Age:12
[Lots of text spanning multiple lines with special characters, new lines and whitespaces]

And I'm trying to split it to the left of:
ID NUMBER:
XSOR-160491"
15632894,136259874,"TEXT:
where all the bolded text is optional (not present in every instance).

I created the following regex expression but it consumes the information I want to keep and doesn't necessarily account for all the bolded optional text.

re.split(r"[0-9] ,[0-9] ?,\"TEXT", test))

I tried adding lookahead with ?=:

re.split(r"?=([0-9] ,[0-9] ?,\"TEXT)", test))

But that didn't seem to work. Any help is greatly appreciated!

Edit: Expected output is as follows:

ID NUMBER:
XSOR-160491"
15632894,136259874,"TEXT:
Name: John    Age:80
...

XSOR-160491"
78452156,784569851,"TEXT:
Name: Sally    Age:31 
...

ID NUMBER:
01236589,489456156878,"TEXT:
Name: Suraj    Age:56
...

00123795,,"TEXT:
Name: Shiloh    Age:12
...

CodePudding user response：

You could capture the (optional) lines up to the line with "TEXT:, but wrap that in a capture group. re.split will then reproduce what is in that capture group as separate entries in the returned list of chunks. You can then pair up these chunks to get the final split:

import re

regex = re.compile(r'(?m)^((?:ID NUMBER:\n)?(?:XSOR-\d "\n)?\d ,\d*,"TEXT:\n)')

s = """
Performed by:

ID NUMBER:
XSOR-160491"
15632894,136259874,"TEXT:
Name: John    Age:80

XSOR-160491"
78452156,784569851,"TEXT:
Name: Sally    Age:31

ID NUMBER:
01236589,489456156878,"TEXT:
Name: Suraj    Age:56

00123795,,"TEXT:
Name: Shiloh    Age:12
"""

it = iter(regex.split(s))
# Pair the "delimiter" chunks with the successor chunks:
result = [next(it)]   [match   next(it) for match in it]

print("----\n".join(result))

The output of this code is:

Performed by:

----
ID NUMBER:
XSOR-160491"
15632894,136259874,"TEXT:
Name: John    Age:80

----
XSOR-160491"
78452156,784569851,"TEXT:
Name: Sally    Age:31

----
ID NUMBER:
01236589,489456156878,"TEXT:
Name: Suraj    Age:56

----
00123795,,"TEXT:
Name: Shiloh    Age:12

The regular expression is quite strict, so if you have some more variation in the lines that start a block, you'll have to relax the regex accordingly.

Explanation of regex

(?m) is the multiline flag (for the whole regex) that indicates that ^ (and $) match line-ends instead of text-ends.
^ requires the match to start at the beginning of a line
(?: )? makes a part optional without creating a so-called capture group for it.
(?:ID NUMBER:\n)? allows for this optional, literal line
(?:XSOR-\d "\n)? allows for this optional line that has some digits (\d )
\d ,\d*,"TEXT:\n' requires a line with two numbers, of which the second is optional.
( ): this wraps around the whole match, and is the only capture group in the regex. re.split will reproduce what is captured inside those parentheses as a separate chunk in the returned list.