I need to remove html attributes from an html string. I have some formatted text input fields that allows users to copy and paste text while keeping the basic html with it. The issue is that some copied text from a word doc comes with attributes that need to be removed. Currently, The regex I'm using works in a regex tester but none of the attributes are being removed.
Code to remove attributes:
var stringhtml = '<div class="Paragraph BCX0 SCXW244271589" paraid="1364880375" paraeid="{8e523337-60c9-4b0d-8c73-fb1a70a2ba58}{165}" style="margin-bottom: 0px;margin-left:96px;padding:0px;user-select:text;-webkit-user-drag:none;-webkit-tap-highlight-color:transparent; overflow-wrap: break-word;">some text</div>'
var regex = /[a-zA-Z]*=".*?"/;
var replacedstring = stringhtml.replace(regex, '');
document.write(replacedstring);
Any help is appreciated!
CodePudding user response:
There's quite a lot of literature out there on why parsing HTML with regex can be quite risky – this famous StackOverflow question is a good example.
As @Polymer has pointed out, your current regex will miss attributes with single quotes, but there are other possibilities too: data
attributes – e.g data-id="233"
will be missed, and also non-quote attributes, like disabled
. There could be more!
You can end up always being on catch-up with this approach, always having to change your regex as you encounter new combinations in your HTML.
A safer approach might be to use the DOMParser
method to parse your string as HTML, and extract the contents from it that way:
let stringhtml = '<div class="Paragraph BCX0 SCXW244271589" paraid="1364880375" paraeid="{8e523337-60c9-4b0d-8c73-fb1a70a2ba58}{165}" style="margin-bottom: 0px;margin-left:96px;padding:0px;user-select:text;-webkit-user-drag:none;-webkit-tap-highlight-color:transparent; overflow-wrap: break-word;">some text</div>'
let parser = new DOMParser();
let parsedResult = parser.parseFromString(stringhtml, 'text/html');
let element = document.createElement(parsedResult.body.firstChild.tagName);
element.innerText = parsedResult.documentElement.textContent;
console.log(element);