Home > Software engineering >  Python - Disect/Tokenize and Iterate Over Segments of Multilined Text with re
Python - Disect/Tokenize and Iterate Over Segments of Multilined Text with re

Time:07-21

Assuming a VCD file with a structure like the one that follows as a minimum example:

#0 <--- section
b10000011#
0$
1%
0&
1'
0(
0)
#2211 <--- section
0'
#2296 <--- section
b0#
1$
#2302 <--- section
0$

I want to split the whole thing into timestamp sections and search in every one of them for certain values. That is to first isolate the section inbetween the #0 and #2211 timestamp, then the section inbetween the #2211 and #2296 and so on.

I am trying to do this with python in the following way.

search_space = "
#0
b10000011#
0$
1%
0&
1'
0(
0)
#2211 
0'
#2296 
b0#
1$
#2302
0$"
# the "delimiter"
timestamp_regex = "\#[0-9] (.*)\#[0-9] "

for match in re.finditer(timestamp_regex, search_space, flags=re.DOTALL|re.MULTILINE):
    print(match.groups())

But it has no effect. What is the proper way to handle such scenario with the re package?

CodePudding user response:

You need to use a lazy quantifier ? here. I made some little changes like this:

timestamp_regex = r"(\#[0-9] )(. ?)(?=\#[0-9] |\Z)"
for match in re.finditer(timestamp_regex, search_space, flags=re.DOTALL|re.MULTILINE):
    print(f"section: {match.group(1)}\nchunk:{match.group(2)}\n----")

output:

section: #0
chunk:
b10000011#
0$
1%
0&
1'
0(
0)

----
section: #2211
chunk: 
0'

----
section: #2296
chunk: 
b0#
1$

----
section: #2302
chunk:
0$

----

Check the pattern at Regex101

  • Related