I'm getting html content from API.
Sample message could look like below
Lorem ipsum dolor sit amet <a href="https://example.com">example.com</a>
Pellentesque porta ligula et justo condimentum, nec tincidunt libero tempor.
Pellentesque nunc justo, tincidunt sit amet suscipit sit amet, auctor <a href="https://google.com">google.com</a>
I need my message to look line below, plain text with
Lorem ipsum dolor sit amet example.com
Pellentesque porta ligula et justo condimentum, nec tincidunt libero tempor.
Pellentesque nunc justo, tincidunt sit amet suscipit sit amet, auctor google.com
I've tried to use regex with groups, js code below
const r = /^<a href.*>(.*?)<\/a>$/gm
let link = `<a href="https://google.com" target="_blank">google.com</a> test <a href="test.com">test.com</a>`
let result
while((result = r.exec(link)) !== null) {
const match = result[1];
link = link.replace(r, match)
}
console.log(link)
I also tried simple code like below
const r = /^<a href.*>(.*?)<\/a>$/gm
let link = `<a href="https://google.com" target="_blank">google.com</a> test <a href="test.com">test.com</a>`
link = link.replaceAll(r, "$1")
console.log(link)
Unfortunately, in both cases after running my code console.log prints "test.com", not whole message.
Are there any better solutions?
CodePudding user response:
You do not need to do it with a regular expression. You can use DOM to remove the links and any other HTML tags.
const htmlString = `Lorem ipsum dolor sit amet <a href="https://example.com">example.com</a>
Pellentesque porta ligula et justo condimentum, nec tincidunt libero tempor.
Pellentesque nunc justo, tincidunt sit amet suscipit sit amet, auctor <a href="https://google.com">google.com</a>`
const parser = new DOMParser();
const doc = parser.parseFromString(htmlString, "text/html");
const text = doc.body.textContent;
console.log(text);
If you just want to remove links and leave other HTML tags that is also possible.
const htmlString = `Lorem ipsum dolor sit amet <a href="https://example.com">example.com</a>
Pellentesque <b>porta</b> ligula <em>et justo</em> condimentum, nec tincidunt libero tempor.
Pellentesque nunc justo, tincidunt sit amet suscipit sit amet, auctor <a href="https://google.com">google.com</a>`
const parser = new DOMParser();
const doc = parser.parseFromString(htmlString, "text/html");
const anchors = doc.body.querySelectorAll("a");
anchors.forEach(node => node.replaceWith(...node.childNodes));
const htmlWithAnchorsRemoved = doc.body.innerHTML;
console.log(htmlWithAnchorsRemoved);
CodePudding user response:
Using regexp to parse html is never a good path to follow. Maybe the following will help you?
const html=`Lorem ipsum dolor sit amet <a href="https://example.com">example.com</a>
Pellentesque porta ligula et justo condimentum, nec tincidunt libero tempor.
Pellentesque nunc justo, tincidunt sit amet suscipit sit amet, auctor <a href="https://google.com">google.com</a>`;
function html2text(html){
const o=document.createElement("div");
o.innerHTML=html;
return o.textContent;
}
console.log(html2text(html));
CodePudding user response:
The pattern for removing all anchor tags from an text would be something like this:
<a.*?</a>
with the global tag.
It will specifically search for all the anchor tags in your string and will match it globally (i.e. all over the text which you are using). You can use this regex with replaceAll function like this:
let value = string.replaceAll("<a[^>]*>(.*?)</a>", "");
Hope this helps. Let me know if you have any queries.
Regards