Given a HLS media playlist as follows:
#EXTM3U
#EXT-X-VERSION:3
#EXT-X-ALLOW-CACHE:NO
#EXT-X-TARGETDURATION:7
#EXT-X-MEDIA-SEQUENCE:0
#EXT-X-PROGRAM-DATE-TIME:2022-09-12T10:03:22.621 02:00
#EXTINF:6.666666667,
seg1.ts
#EXT-X-PROGRAM-DATE-TIME:2022-09-12T10:03:29.637 02:00
#EXTINF:6.666666667,
seg2.ts
#EXT-X-PROGRAM-DATE-TIME:2022-09-12T10:03:36.583 02:00
#EXTINF:6.666666666,
seg3.ts
I want to create a regular expression to match the datetime following the EXT-X-PROGRAM-DATE-TIME
tag closest to a specified .ts file name. For example, I want to be able to retrieve the datetime 2022-09-12T10:03:29.637 02:00
, by specifying that the match should end with seg2.ts
. It should work even if new tags are added in between the file name and the EXT-X-PROGRAM-DATE-TIME
tag in the future.
This pattern (EXT-X-PROGRAM-DATE-TIME:(.*)[\s\S]*?seg2.ts
) is my best effort so far, but I can't figure out how make the match start at the last possible EXT-X-PROGRAM-DATE-TIME
tag. The lazy quantifier did not help. The group that is currently captured is the datetime following the first EXT-X-PROGRAM-DATE-TIME
, i.e. 2022-09-12T10:03:22.621 02:00
.
I also looked at using negative lookahead, but I can't figure out how to combine that with matching a variable number of characters and whitespaces before the seg2.ts
.
I'm sure this has been answered before in another context, but I just can't find the right search terms.
CodePudding user response:
We can use re.search
here along with a regex tempered dot trick:
#Python 2.7.17
import re
inp = """#EXTM3U
#EXT-X-VERSION:3
#EXT-X-ALLOW-CACHE:NO
#EXT-X-TARGETDURATION:7
#EXT-X-MEDIA-SEQUENCE:0
#EXT-X-PROGRAM-DATE-TIME:2022-09-12T10:03:22.621 02:00
#EXTINF:6.666666667,
seg1.ts
#EXT-X-PROGRAM-DATE-TIME:2022-09-12T10:03:29.637 02:00
#EXTINF:6.666666667,
seg2.ts
#EXT-X-PROGRAM-DATE-TIME:2022-09-12T10:03:36.583 02:00
#EXTINF:6.666666666,
seg3.ts"""
match = re.search(r'#EXT-X-PROGRAM-DATE-TIME:(\S )(?:(?!EXT-X-PROGRAM-DATE-TIME).)*\bseg2\.ts', inp, flags=re.S)
if match:
print(match.group(1)) # 2022-09-12T10:03:29.637 02:00
Here is an explanation of the regex pattern:
#EXT-X-PROGRAM-DATE-TIME:
(\S )
match and capture the timestamp(?:(?!EXT-X-PROGRAM-DATE-TIME).)*
match all content WITHOUT crossing the next section\bseg2\.ts
match the filename if match:
CodePudding user response:
You might write the pattern not crossing lines that start with seg
lines, and then match seg2.ts
^#EXT-X-PROGRAM-DATE-TIME:(.*)(?:\n(?!seg\d \.ts$).*)*\nseg2\.ts$
^
Start of string#EXT-X-PROGRAM-DATE-TIME:
Match literally(.*)
Capture group 1, match the rest of the line (note that this can also match an empty string)(?:\n(?!seg\d \.ts$).*)*
Match all lines that do not start with the seq pattern\nseg2\.ts
Match a newline andseq2.ts
$
End of string
import re
pattern = r"^#EXT-X-PROGRAM-DATE-TIME:(.*)(?:\n(?!seg\d \.ts$).*)*\nseg2\.ts$"
s = ("#EXTM3U\n"
"#EXT-X-VERSION:3\n"
"#EXT-X-ALLOW-CACHE:NO\n"
"#EXT-X-TARGETDURATION:7\n"
"#EXT-X-MEDIA-SEQUENCE:0\n\n"
"#EXT-X-PROGRAM-DATE-TIME:2022-09-12T10:03:22.621 02:00\n"
"#EXTINF:6.666666667,\n"
"seg1.ts\n"
"#EXT-X-PROGRAM-DATE-TIME:2022-09-12T10:03:29.637 02:00\n"
"#EXTINF:6.666666667,\n"
"seg2.ts\n"
"#EXT-X-PROGRAM-DATE-TIME:2022-09-12T10:03:36.583 02:00\n"
"#EXTINF:6.666666666,\n"
"seg3.ts")
m = re.search(pattern, s, re.M)
if m:
print(m.group(1))
Output
2022-09-12T10:03:29.637 02:00
If you also do not want to cross matching the #EXT-X parts in between, you can add that as an alternative to the negative lookahead:
^#EXT-X-PROGRAM-DATE-TIME:(.*)(?:\n(?!seg\d \.ts\b|#EXT-X-PROGRAM-DATE-TIME:).*)*\nseg2\.ts$