How to capture all repitions of a subpattern in regex-CodePudding

I have a formatted string, that can have a repeated part of arbitrary length. For example, here is an example of the metadata I have that I want to parse.

File Name: chb03_34.edf
File Start Time: 01:51:23
File End Time: 2:51:23
Number of Seizures in File: 1
Seizure Start Time: 1982 seconds
Seizure End Time: 2029 seconds

File Name: chb23_07.edf
File Start Time: 11:03:16
File End Time: 11:45:56
Number of Seizures in File: 0

File Name: chb23_08.edf
File Start Time: 11:48:05
File End Time: 14:40:27
Number of Seizures in File: 2
Seizure 1 Start Time: 325 seconds
Seizure 1 End Time: 345 seconds
Seizure 2 Start Time: 5104 seconds
Seizure 2 End Time: 5151 seconds

File Name: chb23_09.edf
File Start Time: 14:40:47
File End Time: 18:41:13
Number of Seizures in File: 4
Seizure 1 Start Time: 2589 seconds
Seizure 1 End Time: 2660 seconds
Seizure 2 Start Time: 6885 seconds
Seizure 2 End Time: 6947 seconds
Seizure 3 Start Time: 8505 seconds
Seizure 3 End Time: 8532 seconds
Seizure 4 Start Time: 9580 seconds
Seizure 4 End Time: 9664 seconds

So far I've created a regex that captures the first lines, but can only capture the last seizure, if a seizure exists, in a block.

import re

summary = "a formatted string read"


pattern = "File Name\: (. )\nFile Start Time\: (. )\nFile End Time\: (. )\nNumber of Seizures in File\: (. )(?:\n|\r|)(?:Seizure(?: | \d )Start Time\: (\d ) seconds\nSeizure(?: | \d )End Time\: (\d ) seconds(?:\n|\r|))*"
pattern = re.compile(pattern)

for p in pattern.finditer(summary):
    print(p.groups())

But the result of such pattern for the last block for example will only capture the seizure 4 start and end time. Is it possible to capture a repeated subpattern recursively?

EDIT: using regex and the pattern The fourth bird has typed in the comments, I can match the strings, but I get a lot of None values in repeated rows, and also completely None rows. How can I get rid of those, or insert the appropriate value?

('chb23_06.edf', '08:57:57', '11:02:43', '1', '3962', '4075')
(None, None, None, None, None, None)
('chb23_07.edf', '11:03:16', '11:45:56', '0', None, None)
(None, None, None, None, None, None)
('chb23_08.edf', '11:48:05', '14:40:27', '2', '325', '345')
(None, None, None, None, '5104', '5151')
(None, None, None, None, None, None)
('chb23_09.edf', '14:40:47', '18:41:13', '4', '2589', '2660')
(None, None, None, None, '6885', '6947')
(None, None, None, None, '8505', '8532')
(None, None, None, None, '9580', '9664')
(None, None, None, None, None, None)
('chb23_10.edf', '18:41:40', '22:41:40', '0', None, None)
(None, None, None, None, None, None)
('chb23_16.edf', '13:46:32', '17:46:32', '0', None, None)
(None, None, None, None, None, None)
('chb23_17.edf', '17:46:42', '21:16:29', '0', None, None)
(None, None, None, None, None, None)
('chb23_19.edf', '02:28:28', '6:28:28', '0', None, None)
(None, None, None, None, None, None)
('chb23_20.edf', '06:28:36', '7:52:05', '0', None, None)
(None, None, None, None, None, None)

CodePudding user response：

Using re, you can capture the optional iterations of the Seizure strings in a group, and then from that group capture the digit values for the seconds:

Pattern

File Name: (. )\nFile Start Time: (. )\nFile End Time: (. )\nNumber of Seizures in File: (. )((?:\nSeizure (?:\d )?Start Time: \d  seconds\nSeizure (?:\d )?End Time: \d  seconds)*)

The pattern matches:

File Name: (. )\n Group 1, match all after File Name: and a newline
File Start Time: (. )\n Group 2, match all after File Start Time: and a newline
File End Time: (. )\n Group 3, match all after File End Time: and a newline
Number of Seizures in File: (. ) Group 4, match all after Number of Seizures in File:
( Group 5
- (?: Non capture group to match as a whole and then optionally repeat
  - \nSeizure (?:\d )?Start Time: \d seconds\n Match a newline and match the Seizure Start Time and a newline at the end
  - Seizure (?:\d )?End Time: \d seconds Match the Seizure End Time
- )* Close the non capture group and optionally repeat it
) Close group 5

Regex demo | Python demo

For example

pattern = re.compile(pattern)

for m in pattern.finditer(summary):
    print(m.group(1))
    print(m.group(2))
    print(m.group(3))
    print(m.group(4))
    print(re.findall(r"(\d ) seconds", m.group(5)))

The output per match would look like: (or an empty list when there are no Seizure values, but you can test for that as well)

chb23_08.edf
11:48:05
14:40:27
2
['325', '345', '5104', '5151']

CodePudding user response：

If you're using the regex module, I would suggest using repeated captures.

I've also added named groups for clarity:

import regex

pattern = regex.compile(
    r"File Name: (?P<name>. )\n"
    r"File Start Time: (?P<start>. )\n"
    r"File End Time: (?P<end>. )\n"
    r"Number of Seizures in File: (?P<count>\d )\n"
    r"(?:\n|(?:Seizure (?:\d )?Start Time: (?P<seizure_start>\d ) seconds\n"
    r"Seizure (?:\d )?End Time: (?P<seizure_end>\d ) seconds\n)*)"
)

summary = """File Name: chb03_34.edf
File Start Time: 01:51:23
File End Time: 2:51:23
Number of Seizures in File: 1
Seizure Start Time: 1982 seconds
Seizure End Time: 2029 seconds

File Name: chb23_07.edf
File Start Time: 11:03:16
File End Time: 11:45:56
Number of Seizures in File: 0

File Name: chb23_08.edf
File Start Time: 11:48:05
File End Time: 14:40:27
Number of Seizures in File: 2
Seizure 1 Start Time: 325 seconds
Seizure 1 End Time: 345 seconds
Seizure 2 Start Time: 5104 seconds
Seizure 2 End Time: 5151 seconds

File Name: chb23_09.edf
File Start Time: 14:40:47
File End Time: 18:41:13
Number of Seizures in File: 4
Seizure 1 Start Time: 2589 seconds
Seizure 1 End Time: 2660 seconds
Seizure 2 Start Time: 6885 seconds
Seizure 2 End Time: 6947 seconds
Seizure 3 Start Time: 8505 seconds
Seizure 3 End Time: 8532 seconds
Seizure 4 Start Time: 9580 seconds
Seizure 4 End Time: 9664 seconds
"""

for match in pattern.finditer(summary):
    print("Name:", match.group("name"))
    print("Seizure Count", match.group("count"))
    seizures = tuple(
        zip(match.captures("seizure_start"),match.captures("seizure_end")))
    for i, (start, end) in enumerate(seizures, start=1):
        print(f"Seizure #{i}: {start} -> {end}")

Prints:

Name: chb03_34.edf
Seizure Count 1
Seizure #1: 1982 -> 2029
Name: chb23_07.edf
Seizure Count 0
Name: chb23_08.edf
Seizure Count 2
Seizure #1: 325 -> 345
Seizure #2: 5104 -> 5151
Name: chb23_09.edf
Seizure Count 4
Seizure #1: 2589 -> 2660
Seizure #2: 6885 -> 6947
Seizure #3: 8505 -> 8532
Seizure #4: 9580 -> 9664