Home > Software design >  Capture all characters in single string between regex matches
Capture all characters in single string between regex matches

Time:07-16

I have a log file with the following format:

00:00:09.476 ;; 00:00:11.111 side:start top:15% bottom:10% sound:80%
FOO BAR FOO FOO FOO BAR

00:00:11.111 ;; 00:00:12.278 side:start top:15% bottom:10% sound:78%
BAR BAR BAR' BAR. FOO.BAR

00:00:12.278 ;; 00:00:14.447 side:start top:15% bottom:10% sound:43%
BOO.BOO. FARFAR.FAR

What I am trying to do is capture all of the text beneath the log data for each entry, so ultimately, end up with a list looking like:

['FOO BAR FOO FOO FOO BAR', 'BAR BAR BAR' BAR. FOO.BAR', 'BOO.BOO. FARFAR.FAR']

I have written the following regular expression and tested that it properly matches the log data:

"\d\d:\d\d:\d\d.\d\d\d ;; \d\d:\d\d:\d\d.\d\d\d side:start top:\d\d% bottom:\d\d% sound:\d\d%"

But I am looking to capture all of the information between these matches, and I am not certain if this is even the best way to do it, vs iterating through the 123,378 line text file and ignoring both blank spaces and matches to the above expression.

What is the most efficient way to return a list of the text after each log entry?

CodePudding user response:

You can use re.findall with a pattern using a lookahead:

^\d\d:\d\d:\d\d.\d\d\d ;; \d\d:\d\d:\d\d.\d\d\d .*((?:\n(?!\d\d:\d\d:\d\d.\d\d\d ;; \d\d:\d\d:\d\d.\d\d\d).*)*)

Regex demo

import re

pattern = r"^\d\d:\d\d:\d\d.\d\d\d ;; \d\d:\d\d:\d\d.\d\d\d .*((?:\n(?!\d\d:\d\d:\d\d.\d\d\d ;; \d\d:\d\d:\d\d.\d\d\d).*)*)"

s = ("00:00:09.476 ;; 00:00:11.111 side:start top:15% bottom:10% sound:80%\n"
            "FOO BAR FOO FOO FOO BAR\n\n"
            "00:00:11.111 ;; 00:00:12.278 side:start top:15% bottom:10% sound:78%\n"
            "BAR BAR BAR' BAR. FOO.BAR\n\n"
            "00:00:12.278 ;; 00:00:14.447 side:start top:15% bottom:10% sound:43%\n"
            "BOO.BOO. FARFAR.FAR")

res = [x.strip() for x in re.findall(pattern, s, re.M)]
print(res)

Output

['FOO BAR FOO FOO FOO BAR', "BAR BAR BAR' BAR. FOO.BAR", 'BOO.BOO. FARFAR.FAR']

Or if the data is that specific, shorten it to:

^\d\d:\d\d:\d\d.\d{3} ;; .*((?:\n(?!\d\d:\d\d:\d\d.\d{3} ;;).*)*)

Regex demo

CodePudding user response:

Alternative solution, using itertools.groupby (the regex is then simpler):

import re
from itertools import groupby

text = """\
00:00:09.476 ;; 00:00:11.111 side:start top:15% bottom:10% sound:80%
FOO BAR FOO FOO FOO BAR

00:00:11.111 ;; 00:00:12.278 side:start top:15% bottom:10% sound:78%
BAR BAR BAR' BAR. FOO.BAR

00:00:12.278 ;; 00:00:14.447 side:start top:15% bottom:10% sound:43%
BOO.BOO. FARFAR.FAR"""

pat = re.compile(r"\d\d:\d\d:\d\d\.\d\d\d ;;")

out = []
for v, g in groupby(text.splitlines(), lambda l: pat.match(l)):
    if not v:
        out.append("\n".join(g).strip())

print(out)

Prints:

['FOO BAR FOO FOO FOO BAR', 
 "BAR BAR BAR' BAR. FOO.BAR", 
 'BOO.BOO. FARFAR.FAR']
  • Related