Get text from <a> tags in text using javascript-CodePudding

I'm getting html content from API.

Sample message could look like below

Lorem ipsum dolor sit amet <a href="https://example.com">example.com</a>
Pellentesque porta ligula et justo condimentum, nec tincidunt libero tempor.
Pellentesque nunc justo, tincidunt sit amet suscipit sit amet, auctor <a href="https://google.com">google.com</a>

I need my message to look line below, plain text with

Lorem ipsum dolor sit amet example.com
Pellentesque porta ligula et justo condimentum, nec tincidunt libero tempor.
Pellentesque nunc justo, tincidunt sit amet suscipit sit amet, auctor google.com

I've tried to use regex with groups, js code below

const r = /^<a href.*>(.*?)<\/a>$/gm

let link = `<a href="https://google.com" target="_blank">google.com</a> test <a href="test.com">test.com</a>`

let result

while((result = r.exec(link)) !== null) {
  const match = result[1];
  link = link.replace(r, match)
}

console.log(link)

I also tried simple code like below

const r = /^<a href.*>(.*?)<\/a>$/gm

let link = `<a href="https://google.com" target="_blank">google.com</a> test <a href="test.com">test.com</a>`

link = link.replaceAll(r, "$1")

console.log(link)

Unfortunately, in both cases after running my code console.log prints "test.com", not whole message.

Are there any better solutions?

CodePudding user response：

You do not need to do it with a regular expression. You can use DOM to remove the links and any other HTML tags.

const htmlString = `Lorem ipsum dolor sit amet <a href="https://example.com">example.com</a>
Pellentesque porta ligula et justo condimentum, nec tincidunt libero tempor.
Pellentesque nunc justo, tincidunt sit amet suscipit sit amet, auctor <a href="https://google.com">google.com</a>`

const parser = new DOMParser();
const doc = parser.parseFromString(htmlString, "text/html");

const text = doc.body.textContent;
console.log(text);

If you just want to remove links and leave other HTML tags that is also possible.

const htmlString = `Lorem ipsum dolor sit amet <a href="https://example.com">example.com</a>
Pellentesque <b>porta</b> ligula <em>et justo</em> condimentum, nec tincidunt libero tempor.
Pellentesque nunc justo, tincidunt sit amet suscipit sit amet, auctor <a href="https://google.com">google.com</a>`

const parser = new DOMParser();
const doc = parser.parseFromString(htmlString, "text/html");

const anchors = doc.body.querySelectorAll("a");
anchors.forEach(node => node.replaceWith(...node.childNodes));

const htmlWithAnchorsRemoved = doc.body.innerHTML;
console.log(htmlWithAnchorsRemoved);

CodePudding user response：

Using regexp to parse html is never a good path to follow. Maybe the following will help you?

const html=`Lorem ipsum dolor sit amet <a href="https://example.com">example.com</a>
Pellentesque porta ligula et justo condimentum, nec tincidunt libero tempor.
Pellentesque nunc justo, tincidunt sit amet suscipit sit amet, auctor <a href="https://google.com">google.com</a>`;

function html2text(html){
 const o=document.createElement("div");
 o.innerHTML=html;
 return o.textContent;
}

console.log(html2text(html));

CodePudding user response：

The pattern for removing all anchor tags from an text would be something like this:

<a.*?</a>

with the global tag.

It will specifically search for all the anchor tags in your string and will match it globally (i.e. all over the text which you are using). You can use this regex with replaceAll function like this:

let value = string.replaceAll("<a[^>]*>(.*?)</a>", "");

You can test the regex

Hope this helps. Let me know if you have any queries.

Regards