I need to perform text and emoji extraction from HTML (I have no control over the HTML I get). I found it fairly simple to remove HTML tags using the following function; however, it strips out the emojis embedded within an <img>
tag. The result should be plain text emoji characters.
I don't care much about spaces, but the cleaner it is, the better.
// this cleans the HTML quite well, but I need to extend it to keep the emojis
const stripTags = (html: string, ...args) => {
return html.replace(/<(\/?)(\w )[^>]*\/?>/g, (_, endMark, tag) => {
return args.includes(tag) ? "<" endMark tag ">" : ""
}).replace(/<!--.*?-->/g, "")
}
<div>
<div >
<span dir="auto">
<div>
<div dir="auto" style="text-align: start;">Herman is 10 and was born in Louisiana. he now lives a wonderful life in Wisconsin.</div>
</div>
<div >
<div dir="auto" style="text-align: start;">he's (mostly) a Beagle and Jack Russell mix.</div>
</div>
<div >
<div dir="auto" style="text-align: start;">
<span ><img height="16" width="16" alt="