I have a formatted string, that can have a repeated part of arbitrary length. For example, here is an example of the metadata I have that I want to parse.
File Name: chb03_34.edf
File Start Time: 01:51:23
File End Time: 2:51:23
Number of Seizures in File: 1
Seizure Start Time: 1982 seconds
Seizure End Time: 2029 seconds
File Name: chb23_07.edf
File Start Time: 11:03:16
File End Time: 11:45:56
Number of Seizures in File: 0
File Name: chb23_08.edf
File Start Time: 11:48:05
File End Time: 14:40:27
Number of Seizures in File: 2
Seizure 1 Start Time: 325 seconds
Seizure 1 End Time: 345 seconds
Seizure 2 Start Time: 5104 seconds
Seizure 2 End Time: 5151 seconds
File Name: chb23_09.edf
File Start Time: 14:40:47
File End Time: 18:41:13
Number of Seizures in File: 4
Seizure 1 Start Time: 2589 seconds
Seizure 1 End Time: 2660 seconds
Seizure 2 Start Time: 6885 seconds
Seizure 2 End Time: 6947 seconds
Seizure 3 Start Time: 8505 seconds
Seizure 3 End Time: 8532 seconds
Seizure 4 Start Time: 9580 seconds
Seizure 4 End Time: 9664 seconds
So far I've created a regex that captures the first lines, but can only capture the last seizure, if a seizure exists, in a block.
import re
summary = "a formatted string read"
pattern = "File Name\: (. )\nFile Start Time\: (. )\nFile End Time\: (. )\nNumber of Seizures in File\: (. )(?:\n|\r|)(?:Seizure(?: | \d )Start Time\: (\d ) seconds\nSeizure(?: | \d )End Time\: (\d ) seconds(?:\n|\r|))*"
pattern = re.compile(pattern)
for p in pattern.finditer(summary):
print(p.groups())
But the result of such pattern for the last block for example will only capture the seizure 4 start and end time. Is it possible to capture a repeated subpattern recursively?
EDIT: using regex
and the pattern The fourth bird has typed in the comments, I can match the strings, but I get a lot of None values in repeated rows, and also completely None rows. How can I get rid of those, or insert the appropriate value?
('chb23_06.edf', '08:57:57', '11:02:43', '1', '3962', '4075')
(None, None, None, None, None, None)
('chb23_07.edf', '11:03:16', '11:45:56', '0', None, None)
(None, None, None, None, None, None)
('chb23_08.edf', '11:48:05', '14:40:27', '2', '325', '345')
(None, None, None, None, '5104', '5151')
(None, None, None, None, None, None)
('chb23_09.edf', '14:40:47', '18:41:13', '4', '2589', '2660')
(None, None, None, None, '6885', '6947')
(None, None, None, None, '8505', '8532')
(None, None, None, None, '9580', '9664')
(None, None, None, None, None, None)
('chb23_10.edf', '18:41:40', '22:41:40', '0', None, None)
(None, None, None, None, None, None)
('chb23_16.edf', '13:46:32', '17:46:32', '0', None, None)
(None, None, None, None, None, None)
('chb23_17.edf', '17:46:42', '21:16:29', '0', None, None)
(None, None, None, None, None, None)
('chb23_19.edf', '02:28:28', '6:28:28', '0', None, None)
(None, None, None, None, None, None)
('chb23_20.edf', '06:28:36', '7:52:05', '0', None, None)
(None, None, None, None, None, None)
CodePudding user response:
Using re
, you can capture the optional iterations of the Seizure strings in a group, and then from that group capture the digit values for the seconds:
Pattern
File Name: (. )\nFile Start Time: (. )\nFile End Time: (. )\nNumber of Seizures in File: (. )((?:\nSeizure (?:\d )?Start Time: \d seconds\nSeizure (?:\d )?End Time: \d seconds)*)
The pattern matches:
File Name: (. )\n
Group 1, match all after File Name: and a newlineFile Start Time: (. )\n
Group 2, match all after File Start Time: and a newlineFile End Time: (. )\n
Group 3, match all after File End Time: and a newlineNumber of Seizures in File: (. )
Group 4, match all after Number of Seizures in File:(
Group 5(?:
Non capture group to match as a whole and then optionally repeat\nSeizure (?:\d )?Start Time: \d seconds\n
Match a newline and match the Seizure Start Time and a newline at the endSeizure (?:\d )?End Time: \d seconds
Match the Seizure End Time
)*
Close the non capture group and optionally repeat it
)
Close group 5
For example
pattern = re.compile(pattern)
for m in pattern.finditer(summary):
print(m.group(1))
print(m.group(2))
print(m.group(3))
print(m.group(4))
print(re.findall(r"(\d ) seconds", m.group(5)))
The output per match would look like: (or an empty list when there are no Seizure values, but you can test for that as well)
chb23_08.edf
11:48:05
14:40:27
2
['325', '345', '5104', '5151']
CodePudding user response:
If you're using the regex module, I would suggest using repeated captures.
I've also added named groups for clarity:
import regex
pattern = regex.compile(
r"File Name: (?P<name>. )\n"
r"File Start Time: (?P<start>. )\n"
r"File End Time: (?P<end>. )\n"
r"Number of Seizures in File: (?P<count>\d )\n"
r"(?:\n|(?:Seizure (?:\d )?Start Time: (?P<seizure_start>\d ) seconds\n"
r"Seizure (?:\d )?End Time: (?P<seizure_end>\d ) seconds\n)*)"
)
summary = """File Name: chb03_34.edf
File Start Time: 01:51:23
File End Time: 2:51:23
Number of Seizures in File: 1
Seizure Start Time: 1982 seconds
Seizure End Time: 2029 seconds
File Name: chb23_07.edf
File Start Time: 11:03:16
File End Time: 11:45:56
Number of Seizures in File: 0
File Name: chb23_08.edf
File Start Time: 11:48:05
File End Time: 14:40:27
Number of Seizures in File: 2
Seizure 1 Start Time: 325 seconds
Seizure 1 End Time: 345 seconds
Seizure 2 Start Time: 5104 seconds
Seizure 2 End Time: 5151 seconds
File Name: chb23_09.edf
File Start Time: 14:40:47
File End Time: 18:41:13
Number of Seizures in File: 4
Seizure 1 Start Time: 2589 seconds
Seizure 1 End Time: 2660 seconds
Seizure 2 Start Time: 6885 seconds
Seizure 2 End Time: 6947 seconds
Seizure 3 Start Time: 8505 seconds
Seizure 3 End Time: 8532 seconds
Seizure 4 Start Time: 9580 seconds
Seizure 4 End Time: 9664 seconds
"""
for match in pattern.finditer(summary):
print("Name:", match.group("name"))
print("Seizure Count", match.group("count"))
seizures = tuple(
zip(match.captures("seizure_start"),match.captures("seizure_end")))
for i, (start, end) in enumerate(seizures, start=1):
print(f"Seizure #{i}: {start} -> {end}")
Prints:
Name: chb03_34.edf
Seizure Count 1
Seizure #1: 1982 -> 2029
Name: chb23_07.edf
Seizure Count 0
Name: chb23_08.edf
Seizure Count 2
Seizure #1: 325 -> 345
Seizure #2: 5104 -> 5151
Name: chb23_09.edf
Seizure Count 4
Seizure #1: 2589 -> 2660
Seizure #2: 6885 -> 6947
Seizure #3: 8505 -> 8532
Seizure #4: 9580 -> 9664