removing html attributes from an html string value using regex-CodePudding

I need to remove html attributes from an html string. I have some formatted text input fields that allows users to copy and paste text while keeping the basic html with it. The issue is that some copied text from a word doc comes with attributes that need to be removed. Currently, The regex I'm using works in a regex tester but none of the attributes are being removed.

Code to remove attributes:

var stringhtml = '<div class="Paragraph  BCX0 SCXW244271589" paraid="1364880375" paraeid="{8e523337-60c9-4b0d-8c73-fb1a70a2ba58}{165}" style="margin-bottom: 0px;margin-left:96px;padding:0px;user-select:text;-webkit-user-drag:none;-webkit-tap-highlight-color:transparent; overflow-wrap: break-word;">some text</div>'

var regex = /[a-zA-Z]*=".*?"/;

var replacedstring = stringhtml.replace(regex, '');

document.write(replacedstring);

Any help is appreciated!

CodePudding user response：

There's quite a lot of literature out there on why parsing HTML with regex can be quite risky – this famous StackOverflow question is a good example.

As @Polymer has pointed out, your current regex will miss attributes with single quotes, but there are other possibilities too: data attributes – e.g data-id="233" will be missed, and also non-quote attributes, like disabled. There could be more!

You can end up always being on catch-up with this approach, always having to change your regex as you encounter new combinations in your HTML.

A safer approach might be to use the DOMParser method to parse your string as HTML, and extract the contents from it that way:

let stringhtml = '<div class="Paragraph  BCX0 SCXW244271589" paraid="1364880375" paraeid="{8e523337-60c9-4b0d-8c73-fb1a70a2ba58}{165}" style="margin-bottom: 0px;margin-left:96px;padding:0px;user-select:text;-webkit-user-drag:none;-webkit-tap-highlight-color:transparent; overflow-wrap: break-word;">some text</div>'

let parser = new DOMParser();
let parsedResult = parser.parseFromString(stringhtml, 'text/html');

let element = document.createElement(parsedResult.body.firstChild.tagName);

element.innerText = parsedResult.documentElement.textContent;

console.log(element);