Home > Back-end >  Convert lists in Word to Docbook transformation
Convert lists in Word to Docbook transformation

Time:12-26

A lot of Word documents (Word 2003 xml) are to be converted into Docbook 5.1 (30 documents, approx. 80 pages each). I have created a stylesheet for this purpose and it works so far. However, I am not getting anywhere with the following problem:

There are many lists in the documents. The Word XML marks out list items (<w:listPr>), but as far as I can see, it does not indicate where the list begins and ends. There are only list points.

In XSLT I can now capture the list items (<listitem>), but I don't know how to surround the list items with the global list tag (<itemizedlist>).

One way could be to capture the lists with for-each-group or something and copy the text-content of the nodes in my target document. But there are other formatting/elements in the list items like <InstrText> (Docbook: <indexterm>) which should not be lost.

How can I handle this?

Word 2003 xml Source (Excerpt)

<w:p>
     <w:pPr>
        <w:pStyle w:val="2Standard"/>
            <w:listPr>
                 <w:ilvl w:val="0"/>
                 <w:ilfo w:val="14"/>
                 <wx:t wx:val="·"/>
                 <wx:font wx:val="Symbol"/>
            </w:listPr>
      </w:pPr>
     <w:r>
          <w:t>die Prognose der Wirtschaft</w:t>
      </w:r>
       <w:r>
          <w:fldChar w:fldCharType="begin"/>
      </w:r>
      <w:r>
          <w:instrText> XE "Wirtschaft"</w:instrText>
      </w:r>
      <w:r>
          <w:fldChar w:fldCharType="end"/>
      </w:r>
</w:p>
<w:p>
     <w:pPr>
        <w:pStyle w:val="2Standard"/>
            <w:listPr>
                 <w:ilvl w:val="0"/>
                 <w:ilfo w:val="14"/>
                 <wx:t wx:val="·"/>
                 <wx:font wx:val="Symbol"/>
            </w:listPr>
      </w:pPr>
      <w:r>
          <w:t>die Beratung der Politik.</w:t>
      </w:r>
</w:p>" 
 

Desired Output


<itemizedlist>
     <listitem>
         <para>die Prognose der Wirtschaft 
            <indexterm><primary>Wirtschaft</primary></indexterm>
         </para>
      </listitem>
      <listitem>
         <para>die Beratung der Politik.</para>
      </listitem>
</itemizedlist>

First Stylesheet approach

<xsl:template match="w:p">
        <xsl:choose>
            <xsl:when test="w:pPr/w:listPr/w:ilvl/@w:val = '0'">
                <listitem>
                    <para>
                       <xsl:apply-templates select="w:r"/>
                    </para>
                </listitem>
            </xsl:when>
            <xsl:otherwise>
                <para>
                    <xsl:apply-templates/>
                </para>
            </xsl:otherwise>
        </xsl:choose>
    </xsl:template>

    <xsl:template match="w:r">
        <xsl:choose>
            <xsl:when test="w:instrText">
                <indexterm>
                    <primary>
                        <xsl:apply-templates select="*/text()"/>
                    </primary>
                </indexterm>
            </xsl:when>
            <xsl:otherwise>
                <xsl:apply-templates select="w:t"/>
            </xsl:otherwise>
        </xsl:choose>
    </xsl:template>

CodePudding user response:

I think it should be possible with an approach along the lines of

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xpath-default-namespace="http://example.com/"
    exclude-result-prefixes="#all"
    version="3.0">

  <xsl:output method="xml" indent="yes" suppress-indentation="indexterm"/>
  <xsl:strip-space elements="*"/>

  <xsl:template match="root">
    <xsl:for-each-group select="p" group-adjacent="boolean(self::p[pPr/listPr])">
        <xsl:choose>
          <xsl:when test="current-grouping-key()">
            <itemizedlist>
              <xsl:apply-templates select="current-group()" mode="list"/>
            </itemizedlist>
          </xsl:when>
          <xsl:otherwise>
            <xsl:apply-templates select="current-group()"/>
          </xsl:otherwise>
        </xsl:choose>
    </xsl:for-each-group>
  </xsl:template>
  
  <xsl:template match="p" mode="list">
    <listitem>
      <para>
        <xsl:apply-templates mode="#current"/>
      </para>
    </listitem>
  </xsl:template>
  
  <xsl:template match="instrText" mode="list">
    <indexterm>
      <primary>
        <xsl:apply-templates mode="#current"/>
      </primary>
    </indexterm>
  </xsl:template>
  
</xsl:stylesheet>

This transforms

<w:root xmlns:w="http://example.com/" xmlns:wx="http://example.com/wx">
  <w:p>
     <w:pPr>
        <w:pStyle w:val="2Standard"/>
            <w:listPr>
                 <w:ilvl w:val="0"/>
                 <w:ilfo w:val="14"/>
                 <wx:t wx:val="·"/>
                 <wx:font wx:val="Symbol"/>
            </w:listPr>
      </w:pPr>
     <w:r>
          <w:t>die Prognose der Wirtschaft</w:t>
      </w:r>
       <w:r>
          <w:fldChar w:fldCharType="begin"/>
      </w:r>
      <w:r>
          <w:instrText> XE "Wirtschaft"</w:instrText>
      </w:r>
      <w:r>
          <w:fldChar w:fldCharType="end"/>
      </w:r>
</w:p>
<w:p>
     <w:pPr>
        <w:pStyle w:val="2Standard"/>
            <w:listPr>
                 <w:ilvl w:val="0"/>
                 <w:ilfo w:val="14"/>
                 <wx:t wx:val="·"/>
                 <wx:font wx:val="Symbol"/>
            </w:listPr>
      </w:pPr>
      <w:r>
          <w:t>die Beratung der Politik.</w:t>
      </w:r>
</w:p> 
</w:root>

into

<itemizedlist>
   <listitem>
      <para>die Prognose der Wirtschaft<indexterm><primary> XE "Wirtschaft"</primary></indexterm>
      </para>
   </listitem>
   <listitem>
      <para>die Beratung der Politik.</para>
   </listitem>
</itemizedlist>

Consider to provide namespace well-formed samples/snippets the next time.

  • Related