Home > database >  Extracting and replacing html link tag with regex
Extracting and replacing html link tag with regex

Time:11-25

I am trying to do some html scraping with JavaScript, and would like to take the a href link and replace it into a hyperlink on a Discord embed. I am having trouble with regex, I am finding it very difficult to learn. I assume I will also need another regex to capture it all so I can replace it with my desired target?

This is an example raw html that I have:

An **example**, also known as a <a href="https://www.example.com/example type">example type</a>

to make this readable within a Discord embed, I am looking for a desired output of:

An **example**, also known as a [**example type**](https://www.example.com/example type)

I have tried extracting the URL via regex, which I can match however, I am having issues with extracting the link and the (I think its called target? The 'example type' in the example link text) and then replacing the string with my desired output. I have the following: (https://regexr.com/73574)

/href="[^"] /g

This matches href="https://www.example.com/example type, and feels like a very early step, it includes 'href' in the match, and it does not capture the target.

EDIT: I apologise, I did not think about additional checks, what if the string has multiple links? and text after them, for example:

An **example**, also known as a <a href="https://www.example.com/example type">example type</a> is the first example, and now I have <a href="https://www.example.com/second">second</a> example

with a desired output of:

An **example**, also known as a [**example type**](https://www.example.com/example type) is the first example, and now I have [**second**](https://www.example.com/second) example

CodePudding user response:

Try this: (?<=href=")[^"]*

By using a lookbehind, you can now verify that the text behind is equal to href=" without capturing it

Demo: https://regex101.com/r/2qMnPt/1

CodePudding user response:

You can use regular expression groups to capture things that interest you. My regular expression here might be far from perfect but I don't think that's important here - it shows you a way and you can always improve it if needed.

Things you have to do:

  • prepare regex that captures groups that you need (anchor tag, anchor text, anchor url),
  • remove the anchor tag completely from the text
  • inject anchor text and anchor href into the final string

Here's a quick code example of that:

const anchorRegex = /(<a\shref="([^"] )">(. ?)<\/a>)/i;
const textToBeParsed = `An **example**, also known as a <a href="https://www.example.com/example type">example type</a>`;

const parseText = (text) => {
    const matches = anchorRegex.exec(textToBeParsed);
  
  if (!matches) {
    console.warn("Something went wrong...");

    return;
  }
  
  const [, fullAnchorTag, anchorUrl, anchorText] = matches;
  const textWithoutAnchorTag = text.replace(fullAnchorTag, '');
  
  return `${textWithoutAnchorTag}[**${anchorText}**](${anchorUrl})`;
};

console.log(parseText(textToBeParsed));

  • Related