I have used CKEDITOR in one of my modules. It stores data with HTML tags like below:
<p>Lorem Ipsum&nbsp;is simply dummy text of the printing and
typesetting industry.Lorem Ipsum has
been the industry&#39;s standard
dummy text ever since the 1500s, when
an unknown printer took a galley of
type and scrambled it to make a type
specimen book. It has survived not
only five centuries, but also the leap
into electronic typesetting,remaining
essentially unchanged. It was
popularised in the 1960s with the
release of Letraset sheets containing
Lorem Ipsum passages, and more
recently with desktop publishing
software like Aldus PageMaker
including versions of Lorem Ipsum.
</p>\n\n<p>
</p>\n\n<p>TItle </p>\n
I have tried to convert in plain text using this regex :
str.replace(/(<([^>] )>)/ig ,'');
However I'm not getting output as expected.
I want this output :
'Lorem Ipsum & is simply dummy text of the printing and typesetting industry.Lorem Ipsum has been the industry &'s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting,remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.TItle.'
Note: This regex removes all html tags except "\n ,  " . So please help me out... How to remove "\n ,  " too from string ?
CodePudding user response:
The text looks to be double-escaped, kinda - first turn all the &
s into &
s, so that the HTML entities can be properly recognized. Then .text()
will give you the plain text version of the HTML markup.
const input = `<p>Lorem Ipsum&nbsp;is simply dummy text of the printing and typesetting industry.Lorem Ipsum has been the industry&#39;s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting,remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.</p>\n\n<p> </p>\n\n<p>TItle </p>\n`;
const inputWithProperEntities = input.replaceAll('&', '&');
console.log($(inputWithProperEntities).text());
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
\n
is not an HTML tag, but representation of a newline character. If you want to remove all newlines too, then:
const input = `<p>Lorem Ipsum&nbsp;is simply dummy text of the printing and typesetting industry.Lorem Ipsum has been the industry&#39;s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting,remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.</p>\n\n<p> </p>\n\n<p>TItle </p>\n`;
const inputWithProperEntities = input.replaceAll('&', '&');
console.log($(inputWithProperEntities).text().replaceAll('\n', ''));
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>