I have a .txt file. Using the following code I read it:
while (!feof($handle)) {
yield trim(utf8_encode(fgets($handle)));
}
Now from the retrieved string I want to remove not only the HTML tags but also the HTML content inside. Found many solutions to remove the tags but not both - tags content.
Sample string - Hey my name is <b>John</b>. I am a <i>coder</i>!
Required output string - Hey my name is . I am a !
How can I achieve this?
CodePudding user response:
One way to achieve this is by using DOMDocument
and DOMXPath
. My solution assumes that the provided HTML string has no container node or that the container node contents are not meant to be stripped (as this would result in a completely empty string).
$string = 'Hey my name is <b>John</b>. I am a <i>coder</i>!';
// create a DOMDocument (an XML/HTML parser)
$dom = new DOMDocument('1.0', 'UTF-8');
// load the HTML string without adding a <!DOCTYPE ...> and <html><body> tags
// and with error/warning reports turned off
// if loading fails, there's something seriously wrong with the HTML
if($dom->loadHTML($string, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED | LIBXML_NOERROR | LIBXML_NOWARNING)) {
// create an DOMXPath instance for the loaded document
$xpath = new DOMXPath($dom);
// remember the root node; DOMDocument automatically adds a <p> container if one is not present
$rootNode = $dom->documentElement;
// fetch all descendant nodes (children and grandchildren, etc.) of the root node
$childNodes = $xpath->query('//*', $rootNode);
// with each of these decendants...
foreach($childNodes as $childNode) {
// ...remove them from their parent node
$childNode->parentNode->removeChild($childNode);
}
// echo the sanitized HTML
echo $rootNode->nodeValue . "\n";
}
If you do want to strip a potential container code then it's going to be a bit harder, because it's difficult to differentiate between an original container node and a container node that's automatically added by DOMDocument
.
Also, if an unintended non-closing tag is found, it can lead to unexpected results, as it will strip everything until the next closing tag, because DOMDocument
will automatically add a closing tag for invalid non-closing tags.