Regex: Match multiple timestamps in a string-CodePudding

I have a text file that line by line details a timestamp at the very start, and may contain other timestamps in between. The first timestamp is always enclosed in [], and the ones in the middle of the line are always enclosed in <>. The goal is to create a regex pattern that can create groups for the timestamp and the text that follows it. I'm pretty new to regex, and I'm having a hard time with it. The text would look like this:

[00:22.88]Lorem <11:53.82>ipsum dolor sit amet, consectetur <98:23.52>adipiscing elit
[00:34.08]eiusmod <00:42.52>tempor incididunt ut <10:67.58>labore et dolore

However, the lines are being fed to regex one by one, so there's no need to account for other lines (although some sort of exception would be needed to match a newline at the end of the line or the end of the file...).

The desired output would be something like this (for each line):

[('00:22.88', 'Lorem '), ('11:53.82', 'ipsum dolor sit amet, consectetur '), ('98:23.52', 'adipiscing elit')]

This pattern for instance, works for the very first timestamp:

\[(\d{2}:\d{2}.\d{2})\]\s*(. )

For the rest, I wouldn't know how to do it, I tried adding | in between the brackets and the less than symbols in an attempt to make it match "this or that", it didn't work:

\[|<(\d{2}:\d{2}.\d{2})\]|>(. )

I also tried this, in an attempt to match anything in between timestamps, it also didn't work.

\[(\d{2}:\d{2}.\d{2})\]\s*<([0-9] :[0-9.]*)>\s*(. )\s*

I would really appreciate it if somebody with more experience with regex could lend me a hand, I have no clue on how to tackle this. I did find a pretty cool website to write regex patterns which was pretty useful when trying to write my own: https://regexr.com/

CodePudding user response：

Going with pure regexp splitting I'd use the following. The regexp matches < or [ followed by your number pattern, then > or ] for the timestamp. For the content it takes everything until the first < and [ occurres.

import re

regex = r"(?:<|\[)([\d]{2}:[\d]{2}\.[\d]{2})(?:\]|>)([^<\[] )"

test_str = ("[00:22.88]Lorem <11:53.82>ipsum dolor sit amet, consectetur <98:23.52>adipiscing elit\n"
    "[00:34.08]eiusmod <00:42.52>tempor incididunt ut <10:67.58>labore et dolore")

matches = re.finditer(regex, test_str, re.MULTILINE)

found = []

for matchNum, match in enumerate(matches, start=1):
    found.append((match.group(1).strip(), match.group(2).strip()))
    
print(found)

The above regexp can be visualezed and debugged with the following link: https://regex101.com/r/Pyr2J4/1

The above regexp might be enough for you but it fails if the text contains a < or [ (e.g. "Lorem < ipsum ..."). If you like to be able to process those too, I suggest to match the timestamps only and then take the rest of the text between the matches as the content. Also the following regexp does not support timestamps like [00:00.00> which the above one does. This takes a little bit more python:

import re

regex = r"<[\d]{2}:[\d]{2}\.[\d]{2}>|\[[\d]{2}:[\d]{2}\.[\d]{2}\]"

test_str = ("[00:22.88]Lorem <11:53.82>ipsum dolor sit amet, consectetur <98:23.52>adipiscing elit\n"
    "[00:34.08]eiusmod <00:42.52>tempor incididunt ut <10:67.58>labore et dolore")

matches = re.finditer(regex, test_str, re.MULTILINE)

found = []
last_match_end = None

for matchNum, match in enumerate(matches, start=1):
    if len(found) > 0 and last_match_end is not None:
        # add the text from the end of the last match to the start of the 
        # current match as the text of the last match (to previous list value)
        found[-1].append(test_str[last_match_end:match.start()].strip())
        
    # take the timestamp (=match) from the current match
    found.append([match.group().strip("<>[]")])
    # save end of this match
    last_match_end = match.end()
    
if len(found) > 0 and last_match_end is not None:
    # add missing text of last match
    found[-1].append(test_str[last_match_end:].strip())

print(found)