How to remove all html tags including ' ' from string?-CodePudding

I have used CKEDITOR in one of my modules. It stores data with HTML tags like below:

<p>Lorem Ipsum&amp;nbsp;is simply dummy text of the printing and 
typesetting industry.Lorem Ipsum has 
been the industry&amp;#39;s standard 
dummy text ever since the 1500s, when 
an unknown printer took a galley of 
type and scrambled it to make a type 
specimen book. It has survived not 
only five centuries, but also the leap 
into electronic typesetting,remaining 
essentially unchanged. It was 
popularised in the 1960s with the 
release of Letraset sheets containing 
Lorem Ipsum passages, and more 
recently with desktop publishing 
software like Aldus PageMaker 
including versions of Lorem Ipsum.
</p>\n\n<p>&nbsp;
</p>\n\n<p>TItle&nbsp;</p>\n

I have tried to convert in plain text using this regex :

str.replace(/(<([^>] )>)/ig ,'');

However I'm not getting output as expected.

I want this output :

'Lorem Ipsum & is simply dummy text of the printing and typesetting industry.Lorem Ipsum has been the industry &'s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting,remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.TItle.'

Note: This regex removes all html tags except "\n , &nbsp" . So please help me out... How to remove "\n , &nbsp" too from string ?

CodePudding user response：

The text looks to be double-escaped, kinda - first turn all the &s into &s, so that the HTML entities can be properly recognized. Then .text() will give you the plain text version of the HTML markup.

const input = `<p>Lorem Ipsum&amp;nbsp;is simply dummy text of the printing and typesetting industry.Lorem Ipsum has been the industry&amp;#39;s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting,remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.</p>\n\n<p>&nbsp;</p>\n\n<p>TItle&nbsp;</p>\n`;
const inputWithProperEntities = input.replaceAll('&amp;', '&');
console.log($(inputWithProperEntities).text());

<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>

\n is not an HTML tag, but representation of a newline character. If you want to remove all newlines too, then:

const input = `<p>Lorem Ipsum&amp;nbsp;is simply dummy text of the printing and typesetting industry.Lorem Ipsum has been the industry&amp;#39;s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting,remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.</p>\n\n<p>&nbsp;</p>\n\n<p>TItle&nbsp;</p>\n`;
const inputWithProperEntities = input.replaceAll('&amp;', '&');
console.log($(inputWithProperEntities).text().replaceAll('\n', ''));

<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>