I have the following sample text:
Performed by:
ID NUMBER:
XSOR-160491"
15632894,136259874,"TEXT:
Name: John Age:80
[Lots of text spanning multiple lines with special characters, new lines and whitespaces]
XSOR-160491"
78452156,784569851,"TEXT:
Name: Sally Age:31
[Lots of text spanning multiple lines with special characters, new lines and whitespaces]
ID NUMBER:
01236589,489456156878,"TEXT:
Name: Suraj Age:56
[Lots of text spanning multiple lines with special characters, new lines and whitespaces]
00123795,,"TEXT:
Name: Shiloh Age:12
[Lots of text spanning multiple lines with special characters, new lines and whitespaces]
And I'm trying to split it to the left of:
ID NUMBER:
XSOR-160491"
15632894,136259874,"TEXT:
where all the bolded text is optional (not present in every instance).
I created the following regex expression but it consumes the information I want to keep and doesn't necessarily account for all the bolded optional text.
re.split(r"[0-9] ,[0-9] ?,\"TEXT", test))
I tried adding lookahead with ?=
:
re.split(r"?=([0-9] ,[0-9] ?,\"TEXT)", test))
But that didn't seem to work. Any help is greatly appreciated!
Edit: Expected output is as follows:
ID NUMBER:
XSOR-160491"
15632894,136259874,"TEXT:
Name: John Age:80
...
XSOR-160491"
78452156,784569851,"TEXT:
Name: Sally Age:31
...
ID NUMBER:
01236589,489456156878,"TEXT:
Name: Suraj Age:56
...
00123795,,"TEXT:
Name: Shiloh Age:12
...
CodePudding user response:
You could capture the (optional) lines up to the line with "TEXT:
, but wrap that in a capture group. re.split
will then reproduce what is in that capture group as separate entries in the returned list of chunks. You can then pair up these chunks to get the final split:
import re
regex = re.compile(r'(?m)^((?:ID NUMBER:\n)?(?:XSOR-\d "\n)?\d ,\d*,"TEXT:\n)')
s = """
Performed by:
ID NUMBER:
XSOR-160491"
15632894,136259874,"TEXT:
Name: John Age:80
XSOR-160491"
78452156,784569851,"TEXT:
Name: Sally Age:31
ID NUMBER:
01236589,489456156878,"TEXT:
Name: Suraj Age:56
00123795,,"TEXT:
Name: Shiloh Age:12
"""
it = iter(regex.split(s))
# Pair the "delimiter" chunks with the successor chunks:
result = [next(it)] [match next(it) for match in it]
print("----\n".join(result))
The output of this code is:
Performed by:
----
ID NUMBER:
XSOR-160491"
15632894,136259874,"TEXT:
Name: John Age:80
----
XSOR-160491"
78452156,784569851,"TEXT:
Name: Sally Age:31
----
ID NUMBER:
01236589,489456156878,"TEXT:
Name: Suraj Age:56
----
00123795,,"TEXT:
Name: Shiloh Age:12
The regular expression is quite strict, so if you have some more variation in the lines that start a block, you'll have to relax the regex accordingly.
Explanation of regex
(?m)
is the multiline flag (for the whole regex) that indicates that^
(and$
) match line-ends instead of text-ends.^
requires the match to start at the beginning of a line(?: )?
makes a part optional without creating a so-called capture group for it.(?:ID NUMBER:\n)?
allows for this optional, literal line(?:XSOR-\d "\n)?
allows for this optional line that has some digits (\d
)\d ,\d*,"TEXT:\n'
requires a line with two numbers, of which the second is optional.( )
: this wraps around the whole match, and is the only capture group in the regex.re.split
will reproduce what is captured inside those parentheses as a separate chunk in the returned list.