How to remove HTML tags as well as HTML content within a string in PHP?-CodePudding

I have a .txt file. Using the following code I read it:

while (!feof($handle)) {
            yield trim(utf8_encode(fgets($handle)));
        }

Now from the retrieved string I want to remove not only the HTML tags but also the HTML content inside. Found many solutions to remove the tags but not both - tags content.

Sample string - Hey my name is <b>John</b>. I am a <i>coder</i>!

Required output string - Hey my name is . I am a !

How can I achieve this?

CodePudding user response：

One way to achieve this is by using DOMDocument and DOMXPath. My solution assumes that the provided HTML string has no container node or that the container node contents are not meant to be stripped (as this would result in a completely empty string).

$string = 'Hey my name is <b>John</b>. I am a <i>coder</i>!';

// create a DOMDocument (an XML/HTML parser)
$dom = new DOMDocument('1.0', 'UTF-8');
// load the HTML string without adding a <!DOCTYPE ...> and <html><body> tags
// and with error/warning reports turned off
// if loading fails, there's something seriously wrong with the HTML
if($dom->loadHTML($string, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED | LIBXML_NOERROR | LIBXML_NOWARNING)) {
  // create an DOMXPath instance for the loaded document
  $xpath = new DOMXPath($dom);

  // remember the root node; DOMDocument automatically adds a <p> container if one is not present
  $rootNode = $dom->documentElement;
  // fetch all descendant nodes (children and grandchildren, etc.) of the root node
  $childNodes = $xpath->query('//*', $rootNode);
  // with each of these decendants...
  foreach($childNodes as $childNode) {
    // ...remove them from their parent node
    $childNode->parentNode->removeChild($childNode);
  }

  // echo the sanitized HTML
  echo $rootNode->nodeValue . "\n";
}

If you do want to strip a potential container code then it's going to be a bit harder, because it's difficult to differentiate between an original container node and a container node that's automatically added by DOMDocument.

Also, if an unintended non-closing tag is found, it can lead to unexpected results, as it will strip everything until the next closing tag, because DOMDocument will automatically add a closing tag for invalid non-closing tags.