Home > front end >  Checking if before the link there is a specific text (regex)
Checking if before the link there is a specific text (regex)

Time:03-06

I am trying to extract all the links from a website that has "Volume", "Volume 1" or "Volume 1:" before the actual link. Currently with the code I have (check below), it will get all the links including pictures, emojis, and other stuff.

Note: right now, its just selecting the links and does not focus on tags or anything, but if I were to check for "volume" or similar, I would need to check for tags also (e.g. volume 1 <a href='liink'>)

Pages you can use to test: 0, 1, 2

Currently, I have this code:

const urlRegex = /https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\ ~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\ .~#?&\/\/=]*)/g;

document.querySelector(".inner").outerHTML.match(urlRegex);

It selects inner element and get all its HTML into string. It will then try to parse all the links from the string. But it also includes all the pictures and other stuff which I don't want other than the actual data (volumes).

If you are confused about what I want, then for example, we have this:

<br>volume 1 <a ... /a><br>
<br>image <a ... /a><br>

I want to get volume 1 link only. Is there any way to prevent it?

CodePudding user response:

You need to put your desired match in between a positive lookahead and a positive lookbehind:

let html = `<br>volume 1 <a href="https://www.google.com" /a><br>\n<br>image <a href="https://www.facebook.com" /a><br>`
let links = html.match(/(?<=volume.*?href=\").*?(?=\")/ig);
console.log(links);

Expression explained:

  • (?<=...) is a positive lookbehind. It asserts that what follows it is preceded by what goes inside it (the ..., which is volume.*?href=\" in the above expression).
  • volume matches the word "volume" literally. Note that all matches here are case-insensitive due to the i flag at the end.
  • .*? matches any character zero or more times, without being greedy. Thus it would match any character until it reaches the next expression.
  • href=\" matches href=" literally.
  • .*? again, matches any character between zero and infinite number of times non-greedily.
  • (?=\") is a positive lookahead. It asserts that what comes before it is followed by ".

You can find an even better explanation here: https://regex101.com/r/SOB1Gi/1.

In short, this expression matches any link that appears after the word volume.

  • Related