Home > other >  How to get closest regex pattern based on a word index JS
How to get closest regex pattern based on a word index JS

Time:11-27

This is my string:

"1
00:01:46,356 --> 00:01:49,893
What is this? It's blue.

2
00:01:50,794 --> 00:01:54,998
We used a different chemical process,
but it is every bit as pure.

3
00:01:55,199 --> 00:01:58,267
It may be blue, but it's the bomb."

I have the word "chemical" index which is: 107 I need the closest time pattern and I used regex, here is my code:

const st =  /above subtitle text

var re = /[0-9][0-9]:[0-9][0-9]:[0-9][0-9],[0-9][0-9][0-9]/;

const match = st.match(re);

How can I decrease the amount of my index number to reach the closest time pattern to it? The return value here should be: 00:01:50,794 --> 00:01:54,998

CodePudding user response:

Assuming the format is consistent, this can be done without regex using split, find and includes:

const s = `1
00:01:46,356 --> 00:01:49,893
What is this? It's blue.

2
00:01:50,794 --> 00:01:54,998
We used a different chemical process,
but it is every bit as pure.

3
00:01:55,199 --> 00:01:58,267
It may be blue, but it's the bomb.`;

console.log(s.split("\n\n").find(e => e.includes("chemical")).split("\n")[1]);
<iframe name="sif1" sandbox="allow-forms allow-modals allow-scripts" frameborder="0"></iframe>

CodePudding user response:

You can just use a positive lookahead to assert that what you're looking for can be found before the word "chemical".

A positive lookahead is denoted by a group beginning with ?=.

Next, we need a pattern to put in the lookahead. It will need to look for the word "chemical" and match everything up to and including it. We can do this fairly simply by asserting the end of a the line ($), then looking for the next line break (\n) and matching any character after it (.) as many times as possible ( ). The resulting pattern would look something like $\n. . I'm assuming it's possible that this word could be on the second or even third line of text, so we will need to create a non-capturing group in order to both ensure this doesn't end up in the final match, and be able to match it multiple times. A non-capturing group is denoted by a group beginning with ?:. Adding this to our existing pattern and allowing it to match multiple times by adding a quantifier would look something like (?:$\n. ) . Now the last part of this pattern is to match the actual word, which is easy! We can just put the word itself after this pattern like so: (?:$\n. ) chemical. Now the last part is to add it to wrap it in the positive lookahead we talked about in the very beginning. It would look something like this:(?=(?:$\n. ) chemical)

Alright, now after that wall of text, we can finally create the pattern that actually matches the timestamps. What you have works for one of them, but based on your question it looks like you want to match both timestamps and the arrow in between, so I will create the pattern to do that.

Let's start with the one you provided: [0-9][0-9]:[0-9][0-9]:[0-9][0-9],[0-9][0-9][0-9]. Before we start on making it match both timestamps, let's simplify it a bit. This: \d\d:\d\d:\d\d,\d\d\d will do the same thing as the pattern you provided, and this too can actually be simplified a tiny bit. Notice that \d\d is in there twice. We can make this a group and tell it to match twice like so: (\d\d:){2}. The shortened query will look like (\d\d:){2}\d\d,\d\d\d. You can also "simplify" the last 3 \ds to \d{3}, but it's only saving 1 character so I won't bother, but you can do it if you would like.

Now that that's simplified a bit, we can start adding to it. Since both timestamps are the same format, we can just use a quantifier to tell it to match multiple times like so: ((\d\d:){2}\d\d,\d\d\d) . Now all we need to do, is account for the arrow in the middle. We can use | to add the arrow sequence (-->) as an alternate match for the expression. All together, it would look like this: ((\d\d:){2}\d\d,\d\d\d| --> )

Now finally, we can put the two parts together into a single regular expression: ((\d\d:){2}\d\d,\d\d\d| --> ) (?=(?:$\n. ) chemical). Here is a Regex101 link with the expression that you can play with.

I apologize for the long-winded answer, but I hope it helps you learn a little bit more about regular expressions!

  • Related