Home > Software design >  How to transform HTML into XML-TEI with PHP?
How to transform HTML into XML-TEI with PHP?

Time:12-04

I need to turn some HTML strings into an XML file written with a specific set of TEI (Text Encoding Initiative) tags. That file should then be provided to lodel, a web-based academic publishing system, in order to get published online.

A bit more context:

  • I'm using PHP 7.2.
  • The HTML strings can be malformed and complex (with tables, images, blockquotes, footnotes, ...).
  • The XML-TEI I need to output is a mix of simple nodes (their creation with SimpleXMLElement is straightforward), and others that must be generated from the HTML.
  • The transformation from HTML to XML-TEI implies some tweaks, such as replacing
<strong>foo</strong>

with

<hi rend="bold">foo</hi>

Or

<h1>Foo</h1>
some other nodes...

with

<div type="div1">
    <head subtype="level1">Foo</head>
    some other nodes...
</div>

What I can't do:

  • Include libtidy or its php class (that would at least help cleaning the HTML)
  • Change the technical situation, even though I know that XML-TEI is supposed to be used to generate HTML and not the opposite.

What I tried:

  • Load the HTML string into a DOMDocument, loop through the nodes and create some separate XML (with XMLSimpleElement, DOM, or even XMLWriter)
  • Load the HTML string as XML (!) into a DOMDocument, load some XSLT, and output XML

I managed to generate some XML with the above methods, and it works with the standard fields, but each time when it comes to the HTML segment I lose either the tree structure or the content. I have the feeling that XSLT would be the best bet, but I can't figure out how to use it.

Edit with code samples:

Example with SimpleXMLElement:

The export class:

class XMLToLodelService {

    $raw_html = '<html><head><meta http-equiv="Content-Type" content="text/html;charset=UTF-8"></head><body><h1>Main <em>Title</em></h1><h4>test</h4><p>&nbsp;</p><p></p><p> </p><p>Paragraph</p><p id="foo">Another paragraph</p><h1>And a <strong>second</strong> title</h1><h2>Some subtitle</h2><p>Foobar</p></body></html>';

    $string = <<<XML
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.tei-c.org/ns/1.0 http://lodel.org/ns/tei/tei.openedition.1.6.2/document.xsd"></TEI>
XML;
    $xml = new SimpleXMLElement($string);
    //...
    
    $text = $xml[0]->addChild('text', '');
    $this->parseBody($text, $raw_html);

    public function parseBody(&$core, $text){
        $dom = new DOMDocument;
        $dom->formatOutput = true;
        $dom->encoding = 'UTF-8';
        $dom->loadHTML(mb_convert_encoding($text, 'HTML-ENTITIES', 'UTF-8'));

        $body = $dom->getElementsByTagName('body')[0];
        $core->addChild('body', '');
        $core = $core->body;

        // let's loop through nodes with DOM functions
        // and add xml step by step in $core
        $body->normalize();
        $this->parseNodes($core, $body->childNodes);
    }

    public function parseNodes(&$core, $elements){
        foreach($elements as $node){
            if($this->isHeading($node)){
                $nextNode = $this->translateHeading($core, $node);
            }elseif($node->nodeName != '#text'){
                $nextNode = $core->addChild($node->nodeName, $node->textContent);
            }else{
                continue;
            }
            if($node->hasChildNodes()){
                $this->parseNodes($nextNode, $node->childNodes);
            }
        }
    }

    public function isHeading($node){
        return in_array($node->nodeName, ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']);
    }

    public function translateHeading(&$core, $node){
        $level = str_split($node->nodeName)[1];
        $head = new ExSimpleXMLElement('<head subtype="level' . $level . '"></head>');
        $div = $core->addChild('div', $head);
        $div->addAttribute('subtype', 'div' . $level);
        return $div;
    }

}

The result :

<TEI xsi:schemaLocation="http://www.tei-c.org/ns/1.0 http://lodel.org/ns/tei/tei.openedition.1.6.2/document.xsd">
    <teiHeader>
        // well-generated code...
    </teiHeader>
    <text>
        <body>
            <div subtype="div1">
                <em>Title</em>
            </div>
            <div subtype="div4"/>
            <p> </p>
            <p/>
            <p> </p>
            <p>Paragraph</p>
            <p>Another paragraph</p>
            <div subtype="div1">
                <strong>second</strong>
            </div>
            <div subtype="div2"/>
            <p>Foobar</p>
        </body>
    </text>
</TEI>

Example with XSLT: Here I just tried to add an id to every h1 item, just to practice XSLT.

The export class:

class XMLToLodelService {

    $raw_html = '<html><head><meta http-equiv="Content-Type" content="text/html;charset=UTF-8"></head><body><h1>Main <em>Title</em></h1><h4>test</h4><p>&nbsp;</p><p></p><p> </p><p>Paragraph</p><p id="foo">Another paragraph</p><h1>And a <strong>second</strong> title</h1><h2>Some subtitle</h2><p>Foobar</p></body></html>';

    $html = new DOMDocument();
    $html->loadXML($raw_html);
    $html->normalizeDocument();

    $xsl = new DOMDocument();
    $xsl->load('xslt.xsl');

    $xsltProcessor = new XSLTProcessor;
    $xsltProcessor->importStylesheet($xsl);

    echo $xsltProcessor->transformToXml($html);

}

The xslt file:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output omit-xml-declaration="yes" indent="yes"/>

  <xsl:template match="//h1">
    <root>
      <xsl:apply-templates select="//h1"/>
    </root>
  </xsl:template>

  <xsl:template match="//h1">
    <xsl:element id="someid{position()}">
      <xsl:value-of select="."/>
    </xsl:element>
  </xsl:template>
</xsl:stylesheet>

The result:

<TEI xsi:schemaLocation="http://www.tei-c.org/ns/1.0 http://lodel.org/ns/tei/tei.openedition.1.6.2/document.xsd">
    <teiHeader>
        // well-generated code...
    </teiHeader>
    <text>
        <body/> //shouldn't be empty
    </text>
</TEI>

I may have overlooked / misunderstood something. Any help will be gladly appreciated.

Edit after ThW's answer:

The accepted answer works like a charm for most of my use cases. I ran into problems for very specific markup. I want to share one in particular here, in case it could help someone.

In order to transform:

<h1>Title</h1>
//some siblings tags...

Into:

<div type="div1">
    <head subtype="level1">Title</head>
    //some siblings tags...
</div>

I had to use a particular approach in my xslt. The accepted answer did not work when nested heading tags were involved, or tags of different levels (i.e. h1 then h2 and so on). I used this xslt markup for this specific case:

  <xsl:template match="/">
      <xsl:apply-templates select="//h1"/>
  </xsl:template>

  <xsl:template match="*[starts-with(local-name(), 'h')]">
    <xsl:variable name="lvl" select="number(substring-after(local-name(), 'h'))"/>
    <div type="div{$lvl}">
      <head subtype="level{$lvl}">
        <xsl:apply-templates select="text()|./*" mode="richtext"/>
      </head>
      <xsl:apply-templates select="//following-sibling::*[not(starts-with(local-name(), 'h'))
                           and preceding-sibling::*[starts-with(local-name(), 'h')][1] = current()]"/>
      <xsl:apply-templates select="//following-sibling::*[local-name() = concat('h', $lvl   1) 
                           and preceding-sibling::*[local-name() = concat('h', $lvl)][1] = current()]"/>
      <xsl:apply-templates select="//following-sibling::*[local-name() = concat('h', $lvl   2) 
                           and preceding-sibling::*[local-name() = concat('h', $lvl)][1] = current()]"/>
      <xsl:apply-templates select="//following-sibling::*[local-name() = concat('h', $lvl   3) 
                           and preceding-sibling::*[local-name() = concat('h', $lvl)][1] = current()]"/>
      <xsl:apply-templates select="//following-sibling::*[local-name() = concat('h', $lvl   4) 
                           and preceding-sibling::*[local-name() = concat('h', $lvl)][1] = current()]"/>
      <xsl:apply-templates select="//following-sibling::*[local-name() = concat('h', $lvl   5) 
                           and preceding-sibling::*[local-name() = concat('h', $lvl)][1] = current()]"/>
    </div>
  </xsl:template>

It's a tweak from this topic: XHTML to Structured XML with XSLT 1.0

Thanks for your time!

CodePudding user response:

I think you have the right idea with XSLT. Specifically load the HTML as HTML into DOM. Here is no need to load it as XML. Then use specific named templates for the base structure and a secondary mode for the richtext fragments.

However it will be some work to map all the HTML elements to TEI elements.

$template = <<<'XSLT'
<xsl:stylesheet 
  version="1.0" 
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xmlns="http://www.tei-c.org/ns/1.0">

  <xsl:output mode="xml" indent="yes"/>

  <!-- match the document element (the html element) -->
  <xsl:template match="/*">
    <!-- add container and header elements -->
    <TEI
      xsi:schemaLocation="http://www.tei-c.org/ns/1.0 http://lodel.org/ns/tei/tei.openedition.1.6.2/document.xsd">
     <xsl:call-template name="tei-header"/>
     <text>
       <!-- apply richtext fragment templates using a separate mode --> 
       <xsl:apply-templates select="body" mode="richtext" />
     </text>
    </TEI>
  </xsl:template>
  
  <!-- named header template -->
  <xsl:template name="tei-header">
    <teiHeader>...</teiHeader>
  </xsl:template>

  <!-- match h1, add id attribute and remove any descendant except text content -->
  <xsl:template match="h1" mode="richtext">
    <head id="someid{position()}">
      <xsl:value-of select="."/>
    </head>
  </xsl:template>

  <!-- match p, add to output and apply templates to descendants -->
  <xsl:template match="p" mode="richtext">
    <p>
      <!-- apply templates to descendants -->
      <xsl:apply-templates mode="richtext"/>
    </p>
  </xsl:template>
  
</xsl:stylesheet>
XSLT;

$htmlDocument = new DOMDocument();
@$htmlDocument->loadHTML(getHTML());

$xslDocument = new DOMDocument();
$xslDocument->loadXML($template);

$processor = new XSLTProcessor();
$processor->importStylesheet($xslDocument);

echo $processor->transformToXML($htmlDocument);

function getHTML() {
  return <<<'HTML'
    <html><head><meta http-equiv="Content-Type" content="text/html;charset=UTF-8"></head><body><h1>Main <em>Title</em></h1><h4>test</h4><p>&nbsp;</p><p></p><p> </p><p>Paragraph</p><p id="foo">Another paragraph</p><h1>And a <strong>second</strong> title</h1><h2>Some subtitle</h2><p>Foobar</p></body></html>
HTML;
}
  • Related