I have this Regex
preg_replace("/<([a-z][a-z0-9]*)[^>]*?(\/?)>/i",'<$1$2>',$string);
to remove all inline tags from html (style, border,....)
What is the easiest way to adjust the regex so that it does not remove href from the "<a..." tag?
CodePudding user response:
Here is no "easy way" to this with RegEx.
A specific tool for this kind of transformations is XSLT. By default it removes any elements from the input and only text nodes are copied to the output.
By defining templates you can match specific nodes and add logic. More specific template matches have priority (a
matches before *
).
<div class="ab">Test</div>
<p style="padding: 1px">
<a href="#123">Link</a>
</p>
<?xml version="1.0"?>
<xsl:stylesheet
version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output
method="html"
encoding="utf-8"
standalone="no"
indent="yes"
omit-xml-declaration="yes"/>
<!-- match any element - pass through without attributes -->
<xsl:template match="*">
<!-- add it to the output -->
<xsl:element name="{local-name()}">
<!-- apply templates to children -->
<xsl:apply-templates/>
</xsl:element>
</xsl:template>
<!-- match "a" elements - pass through with href attribute -->
<xsl:template match="a">
<!-- add a new "element" to the output -->
<xsl:element name="a">
<!-- copy the "href attribute" -->
<xsl:copy-of select="@href"/>
<!-- apply templates to children -->
<xsl:apply-templates/>
</xsl:element>
</xsl:template>
<!-- match "html" or "body" elements - ignore them -->
<xsl:template match="html|body">
<!-- apply templates to children -->
<xsl:apply-templates/>
</xsl:template>
</xsl:stylesheet>
You can define rules for each individual tag.
Next use PHP to process the HTML with the XSLT.
$document = new DOMDocument();
$document->loadHTML($html);
$stylesheet = new DOMDocument();
$stylesheet->load($xsltFile);
$processor = new XSLTProcessor();
$processor->importStyleSheet($stylesheet);
$result = $processor->transformToXml($document);
echo $result;
Output:
<div>Test</div>
<p>
<a href="#123">Link</a>
</p>