Regex to remove spaces between invalid HTML tags - i.e. "< / b >" should be "&l-CodePudding

I have some HTML that is all mangled with the spaces within the tags and wants to make it valid again - for example:

< div class='test' >1 > 0 is < b >true</ b> and apples >>> bananas< / div >

Should be converted to valid HTML and when rendered, it would expectedly produce:

<div class='test'>1 > 0 is <b>true</b> and apples >>> bananas</div>

Any > or < preceded/followed by spaces in the text should be left unchanged - for example, 1 > 0 should remain, rather than being squashed to 1>0

I realize this will probably take a couple of regex expressions, which is fine

I have a few things:

<\s?\/\s* which will partially fix </ b>< / div > to </b></div >, but am struggling with the rest

For example, I could go with a heavy-handed approach, but this will also break code within the text parts of the tags, rather than the tag names themselves

CodePudding user response：

There's no reasonable way to save a document as corrupt as what you've posted, but assuming you replace the > and similar characters in the text the their relevant entities, eg: >, you can massage the document to be accepted into a proper library like DomDocument which will handle the rest.

$input = <<<_E_
< div class='test' >1 &gt; 0 is < b >true</ b> and apples &gt;&gt;&gt; bananas< / div >
_E_;

$input = preg_replace([ '#<\s #', '#</\s #' ], [ '<', '</' ], $input);

$d = new DomDocument();
$d->loadHTML($input, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

var_dump($d->saveHTML());

Output:

string(80) "<div >1 &gt; 0 is <b>true</b> and apples &gt;&gt;&gt; bananas</div>"

CodePudding user response：

This regex works too:

It captures valid sections in an HTML tag in four parts and replaces the rest (spaces) with that.

Regex101 Demo

/(<)\s*(\/?)\s*([^<>]*\S)\s*(>)/g

(<) - capture starting angular bracket (section 1)
\s* - match any spaces
(\/?) - capture optional backward slash (section 2)
\s* - match any spaces after the backward slash
([^<>]*\S) - capture the content inside the tag without the trailing spaces (section 3)
\s* - match spaces after the content and before the closing angular bracket
(>) - capture the closing angular bracket (section 4)

const reg = /(<)\s*(\/?)\s*([^<>]*\S)\s*(>)/g
const str = "< div class='test' >1 > 0 is < b >true< / b > and apples >>> bananas< / div  >"
const newStr = str.replace(reg, "$1$2$3$4");
console.log(newStr);

CodePudding user response：

You can use a couple of .replace()s with a RegEx and a custom replace callback:

let s = `< div class='test' >1 > 0 is < b >true</ b> and apples >>> bananas< / div >`;

s = s.replace(/<.*?>/g, m => m.replaceAll(' ', '').replace(m.match(/[a-zA-Z] /)[0], tagName => tagName   ' ').replace(' >', '>')
);

console.log(s);

Here's a breakdown of the RegExs:

s.replace(/<.*?>/g, /* arrow function */)

This will run the long arrow function as the custom replacer function for everything inside of the < and > brackets. This way, the replacement will only affect inside the tags. The arrow function takes one parameter, m, which is the original text, and returns text to replace it with.

m.replaceAll(' ', '')

Removes all spaces in the string. This will also remove spaces between the tag name and the attributes, so we need step 3.

.replace(m.match(/[a-zA-Z] /)[0], tagName => tagName ' ')

This takes the result of step 2 and adds a space after each tag name. m.match(/[a-zA-Z] /)[0] will be the tag name because m still contains the original text before step 2.

.replace(' >', '>')

This will get the last edge case where there were no attributes or the tag was an ending tag so step 3 actually added an unnecessary space.