Home > Blockchain >  Text and emoji extraction from HTML
Text and emoji extraction from HTML

Time:08-20

I need to perform text and emoji extraction from HTML (I have no control over the HTML I get). I found it fairly simple to remove HTML tags using the following function; however, it strips out the emojis embedded within an <img> tag. The result should be plain text emoji characters.

I don't care much about spaces, but the cleaner it is, the better.

// this cleans the HTML quite well, but I need to extend it to keep the emojis
const stripTags = (html: string, ...args) => {
    return html.replace(/<(\/?)(\w )[^>]*\/?>/g, (_, endMark, tag) => {
        return args.includes(tag) ? "<"   endMark   tag   ">" : ""
    }).replace(/<!--.*?-->/g, "")
}
<div>
   <div >
      <span dir="auto">
         <div>
            <div dir="auto" style="text-align: start;">Herman is 10 and was born in Louisiana. he now lives a wonderful life in Wisconsin.</div>
         </div>
         <div >
            <div dir="auto" style="text-align: start;">he's (mostly) a Beagle and Jack Russell mix.</div>
         </div>
         <div >
            <div dir="auto" style="text-align: start;">
               <span ><img height="16" width="16" alt="           
  • Related