Regular expression to match closest tag above specific word (HLS media playlist)-CodePudding

Given a HLS media playlist as follows:

#EXTM3U
#EXT-X-VERSION:3
#EXT-X-ALLOW-CACHE:NO
#EXT-X-TARGETDURATION:7
#EXT-X-MEDIA-SEQUENCE:0

#EXT-X-PROGRAM-DATE-TIME:2022-09-12T10:03:22.621 02:00
#EXTINF:6.666666667,
seg1.ts
#EXT-X-PROGRAM-DATE-TIME:2022-09-12T10:03:29.637 02:00
#EXTINF:6.666666667,
seg2.ts
#EXT-X-PROGRAM-DATE-TIME:2022-09-12T10:03:36.583 02:00
#EXTINF:6.666666666,
seg3.ts

I want to create a regular expression to match the datetime following the EXT-X-PROGRAM-DATE-TIME tag closest to a specified .ts file name. For example, I want to be able to retrieve the datetime 2022-09-12T10:03:29.637 02:00, by specifying that the match should end with seg2.ts. It should work even if new tags are added in between the file name and the EXT-X-PROGRAM-DATE-TIME tag in the future.

This pattern (EXT-X-PROGRAM-DATE-TIME:(.*)[\s\S]*?seg2.ts) is my best effort so far, but I can't figure out how make the match start at the last possible EXT-X-PROGRAM-DATE-TIME tag. The lazy quantifier did not help. The group that is currently captured is the datetime following the first EXT-X-PROGRAM-DATE-TIME, i.e. 2022-09-12T10:03:22.621 02:00.

I also looked at using negative lookahead, but I can't figure out how to combine that with matching a variable number of characters and whitespaces before the seg2.ts.

I'm sure this has been answered before in another context, but I just can't find the right search terms.

CodePudding user response：

We can use re.search here along with a regex tempered dot trick:

#Python 2.7.17

import re

inp = """#EXTM3U
#EXT-X-VERSION:3
#EXT-X-ALLOW-CACHE:NO
#EXT-X-TARGETDURATION:7
#EXT-X-MEDIA-SEQUENCE:0

#EXT-X-PROGRAM-DATE-TIME:2022-09-12T10:03:22.621 02:00
#EXTINF:6.666666667,
seg1.ts
#EXT-X-PROGRAM-DATE-TIME:2022-09-12T10:03:29.637 02:00
#EXTINF:6.666666667,
seg2.ts
#EXT-X-PROGRAM-DATE-TIME:2022-09-12T10:03:36.583 02:00
#EXTINF:6.666666666,
seg3.ts"""

match = re.search(r'#EXT-X-PROGRAM-DATE-TIME:(\S )(?:(?!EXT-X-PROGRAM-DATE-TIME).)*\bseg2\.ts', inp, flags=re.S)
if match:
    print(match.group(1))  # 2022-09-12T10:03:29.637 02:00

Here is an explanation of the regex pattern:

#EXT-X-PROGRAM-DATE-TIME:
(\S ) match and capture the timestamp
(?:(?!EXT-X-PROGRAM-DATE-TIME).)* match all content WITHOUT crossing the next section
\bseg2\.ts match the filename if match:

CodePudding user response：

You might write the pattern not crossing lines that start with seg lines, and then match seg2.ts

^#EXT-X-PROGRAM-DATE-TIME:(.*)(?:\n(?!seg\d \.ts$).*)*\nseg2\.ts$

^ Start of string
#EXT-X-PROGRAM-DATE-TIME: Match literally
(.*) Capture group 1, match the rest of the line (note that this can also match an empty string)
(?:\n(?!seg\d \.ts$).*)* Match all lines that do not start with the seq pattern
\nseg2\.ts Match a newline and seq2.ts
$ End of string

Regex demo

import re

pattern = r"^#EXT-X-PROGRAM-DATE-TIME:(.*)(?:\n(?!seg\d \.ts$).*)*\nseg2\.ts$"

s = ("#EXTM3U\n"
            "#EXT-X-VERSION:3\n"
            "#EXT-X-ALLOW-CACHE:NO\n"
            "#EXT-X-TARGETDURATION:7\n"
            "#EXT-X-MEDIA-SEQUENCE:0\n\n"
            "#EXT-X-PROGRAM-DATE-TIME:2022-09-12T10:03:22.621 02:00\n"
            "#EXTINF:6.666666667,\n"
            "seg1.ts\n"
            "#EXT-X-PROGRAM-DATE-TIME:2022-09-12T10:03:29.637 02:00\n"
            "#EXTINF:6.666666667,\n"
            "seg2.ts\n"
            "#EXT-X-PROGRAM-DATE-TIME:2022-09-12T10:03:36.583 02:00\n"
            "#EXTINF:6.666666666,\n"
            "seg3.ts")

m = re.search(pattern, s, re.M)
if m:
    print(m.group(1))

Output

2022-09-12T10:03:29.637 02:00

If you also do not want to cross matching the #EXT-X parts in between, you can add that as an alternative to the negative lookahead:

^#EXT-X-PROGRAM-DATE-TIME:(.*)(?:\n(?!seg\d \.ts\b|#EXT-X-PROGRAM-DATE-TIME:).*)*\nseg2\.ts$