Home > Enterprise >  RegEx to remove all properties from style attribute except font-family in php
RegEx to remove all properties from style attribute except font-family in php

Time:11-25

I want to remove all properties from style attribute except font-family in php

I have tried this

style=(.*)font-[^;] ;

Example html

<div style='margin: 0px 14.3906px 0px 28.7969px; padding: 0px; width: 436.797px; float: left; font-family: "Open Sans", Arial, sans-serif;'><p style="margin-right: 0px; margin-bottom: 15px; margin-left: 0px; padding: 0px; text-align: justify;"><strong style="margin: 0px; padding: 0px;">Lorem Ipsum</strong>&nbsp;is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.</p><div><br></div></div><div style='margin: 0px 28.7969px 0px 14.3906px; padding: 0px; width: 436.797px; float: right; font-family: "Open Sans", Arial, sans-serif;'></div>

expected output

<div style='font-family: "Open Sans", Arial, sans-serif;'><p><strong>Lorem Ipsum</strong>&nbsp;is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.</p><div><br></div></div><div style='font-family: "Open Sans", Arial, sans-serif;'></div>

but not working as expected. Anything need to change in this?

CodePudding user response:

I think you have to do this with two substitutions. The first substitution will leave only the font-family style in those HTML tags whose styles contain a font-family specification:

Use the following regex:

style=(['"])(?:[^\1>]*)(font-family:[^;] ;)(?:[^\1>]*)\1

With substitution:

style=$1$2$1

See Regex Demo

  1. style= matches 'style='
  2. ['"]) matches either a single or double quote in capture group 1
  3. (?:[^\1>]*) matches 0 or more characters that aren't a '>' or the single or double quote that was capture group 1 (this ensures we do not scan past the end of the current HTML tag)
  4. (font-family:[^;] ; matches the font-family declaration
  5. (?:[^\1>]*) matches the rest of the style declaration (all characters other than the opening quote ensuring we don't scan past the current HTML tag)
  6. \1 matches the opening quote (whatever was in capture group 1, i.e. a single or double quote)

The next regex will completely remove the style specification for those HTML tags that did not contain a font-family specification to begin with:

Use regex:

\sstyle=(['"])(?![^\1>]*font-family:)(?:[^\1>]*)\1

And substitute the empty string, ''

See Regex Demo

  1. \sstyle= matches whitespace followed by 'style-'.
  2. (['"]) matches a single or double quote in capture group 1.
  3. (?![^\1>]*font-family:) a negative look ahead assertion that what follows is not: 0 or more characters that do no match the opening quote (what is capture group 1) or '>' followed by 'font-family:. In other words, this style specification does not contain 'font-family:'.
  4. (?:[^\1>]*) matches 0 or more characters that do no match the opening quote (what is capture group 1) or '>'.
  5. \1 matches the opening quote character (what was in capture group 1.

PHP Code

<?php

$html = <<<EOF
<div style='margin: 0px 14.3906px 0px 28.7969px; padding: 0px; width: 436.797px; float: left; font-family: "Open Sans", Arial, sans-serif;'><p style="margin-right: 0px; margin-bottom: 15px; margin-left: 0px; padding: 0px; text-align: justify;"><strong style="margin: 0px; padding: 0px;">Lorem Ipsum</strong>&nbsp;is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.</p><div><br></div></div><div style='margin: 0px 28.7969px 0px 14.3906px; padding: 0px; width: 436.797px; float: right; font-family: "Open Sans", Arial, sans-serif;'></div>
EOF;
$html = preg_replace('/style=([\'"])(?:[^\1>]*)(font-family:[^;] ;)(?:[^\1>]*)\1/', 'style=$1$2$1', $html);
$html = preg_replace('/\sstyle=([\'"])(?![^\1>]*font-family:)(?:[^\1>]*)\1/', '', $html);
echo $html;

Prints:

<div style='font-family: "Open Sans", Arial, sans-serif;'><p><strong>Lorem Ipsum</strong>&nbsp;is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.</p><div><br></div></div><div style='font-family: "Open Sans", Arial, sans-serif;'></div>

See PHP Demo

CodePudding user response:

You might consider using DOMDocument to get for example the style attribute from all elements.

If there is a style, then you can use a pattern to capture the font-family part in a capture group and use that group in the replacement.

.*?\b(font-[^;] ;?).*|.*

The pattern matches:

  • .*? Match as least as possible chars
  • \b( A word boundary, start capture group 1
    • font-[^;] ;? Match font- and then 1 chars other than ; followed by an optional ;
  • ) Close group 1
  • .* Match the rest of the line
  • |
  • .* Match the whole line

Regex demo

For example

$data = <<<DATA
<div style='margin: 0px 14.3906px 0px 28.7969px; padding: 0px; width: 436.797px; float: left; font-family: "Open Sans", Arial, sans-serif;'><p style="margin-right: 0px; margin-bottom: 15px; margin-left: 0px; padding: 0px; text-align: justify;"><strong style="margin: 0px; padding: 0px;">Lorem Ipsum</strong>&nbsp;is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.</p><div><br></div></div><div style='margin: 0px 28.7969px 0px 14.3906px; padding: 0px; width: 436.797px; float: right; font-family: "Open Sans", Arial, sans-serif;'></div>
DATA;

$dom = new DOMDocument();
$dom->loadHTML($data, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
foreach($dom->getElementsByTagName('*') as $element ){
    if ($element->hasAttribute('style')) {
        $style = $element->getAttribute('style');
        $replacement = preg_replace("/.*?\b(font-[^;] ;?).*|.*/", "$1", $style);
        if (trim($replacement) !== "") {
            $element->setAttribute('style', $replacement);
        } else {
            $element->removeAttribute('style');
        }
    }
}

echo $dom->saveHTML();

Output

<div style='font-family: "Open Sans", Arial, sans-serif;'><p><strong>Lorem Ipsum</strong>&nbsp;is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.</p><div><br></div><div style='font-family: "Open Sans", Arial, sans-serif;'></div></div>

CodePudding user response:

You can use the following Regex pattern to match the font family and then subtract the "font-family: " text later.

font-family: [-A-Za-z," ]

Although I haven't tested in php, the Python result is as follows;

"Open Sans", Arial, sans-serif
  • Related