I have a log file with the following format:
00:00:09.476 ;; 00:00:11.111 side:start top:15% bottom:10% sound:80%
FOO BAR FOO FOO FOO BAR
00:00:11.111 ;; 00:00:12.278 side:start top:15% bottom:10% sound:78%
BAR BAR BAR' BAR. FOO.BAR
00:00:12.278 ;; 00:00:14.447 side:start top:15% bottom:10% sound:43%
BOO.BOO. FARFAR.FAR
What I am trying to do is capture all of the text beneath the log data for each entry, so ultimately, end up with a list looking like:
['FOO BAR FOO FOO FOO BAR', 'BAR BAR BAR' BAR. FOO.BAR', 'BOO.BOO. FARFAR.FAR']
I have written the following regular expression and tested that it properly matches the log data:
"\d\d:\d\d:\d\d.\d\d\d ;; \d\d:\d\d:\d\d.\d\d\d side:start top:\d\d% bottom:\d\d% sound:\d\d%"
But I am looking to capture all of the information between these matches, and I am not certain if this is even the best way to do it, vs iterating through the 123,378
line text file and ignoring both blank spaces and matches to the above expression.
What is the most efficient way to return a list of the text after each log entry?
CodePudding user response:
You can use re.findall with a pattern using a lookahead:
^\d\d:\d\d:\d\d.\d\d\d ;; \d\d:\d\d:\d\d.\d\d\d .*((?:\n(?!\d\d:\d\d:\d\d.\d\d\d ;; \d\d:\d\d:\d\d.\d\d\d).*)*)
import re
pattern = r"^\d\d:\d\d:\d\d.\d\d\d ;; \d\d:\d\d:\d\d.\d\d\d .*((?:\n(?!\d\d:\d\d:\d\d.\d\d\d ;; \d\d:\d\d:\d\d.\d\d\d).*)*)"
s = ("00:00:09.476 ;; 00:00:11.111 side:start top:15% bottom:10% sound:80%\n"
"FOO BAR FOO FOO FOO BAR\n\n"
"00:00:11.111 ;; 00:00:12.278 side:start top:15% bottom:10% sound:78%\n"
"BAR BAR BAR' BAR. FOO.BAR\n\n"
"00:00:12.278 ;; 00:00:14.447 side:start top:15% bottom:10% sound:43%\n"
"BOO.BOO. FARFAR.FAR")
res = [x.strip() for x in re.findall(pattern, s, re.M)]
print(res)
Output
['FOO BAR FOO FOO FOO BAR', "BAR BAR BAR' BAR. FOO.BAR", 'BOO.BOO. FARFAR.FAR']
Or if the data is that specific, shorten it to:
^\d\d:\d\d:\d\d.\d{3} ;; .*((?:\n(?!\d\d:\d\d:\d\d.\d{3} ;;).*)*)
CodePudding user response:
Alternative solution, using itertools.groupby
(the regex is then simpler):
import re
from itertools import groupby
text = """\
00:00:09.476 ;; 00:00:11.111 side:start top:15% bottom:10% sound:80%
FOO BAR FOO FOO FOO BAR
00:00:11.111 ;; 00:00:12.278 side:start top:15% bottom:10% sound:78%
BAR BAR BAR' BAR. FOO.BAR
00:00:12.278 ;; 00:00:14.447 side:start top:15% bottom:10% sound:43%
BOO.BOO. FARFAR.FAR"""
pat = re.compile(r"\d\d:\d\d:\d\d\.\d\d\d ;;")
out = []
for v, g in groupby(text.splitlines(), lambda l: pat.match(l)):
if not v:
out.append("\n".join(g).strip())
print(out)
Prints:
['FOO BAR FOO FOO FOO BAR',
"BAR BAR BAR' BAR. FOO.BAR",
'BOO.BOO. FARFAR.FAR']