Home > database >  How can I group XML elements, return only the text values, and wrap in new XML elements?
How can I group XML elements, return only the text values, and wrap in new XML elements?

Time:06-11

I am working with TEI documents with the following structure:

<body>
  <pb n="1"/>
  <head>text text <lb/>text text<lb/>text text <ref target="n1">
      <hi rend="super">1</hi>
    </ref>
  </head>
  <byline>text</byline>
  <note type="bio" place="bottom">text text...</note>
  <div>
    <p>text, <title>text</title> text <title>text</title> text text text<title>Saved</title> text text text</p>
    <q>text text text text<ref target="n2">
        <hi rend="super">2</hi>
      </ref></q>
    <p>text <title>text</title> and <title>text, </title>
    </p>
  </div>
  <pb n="2"/>
  <div>
    <p>
      <title>text</title> text text text</p>
    <p> text text text text <hi rend="italic">text,</hi> "text." text text text <hi rend="italic">text,</hi> text text text<ref target="n3">
        <hi rend="super">3</hi>
      </ref> text text text <hi rend="italic">text</hi> text text text <hi rend="italic">text</hi> text text text text<ref target="n4">
        <hi rend="super">4</hi>
      </ref> text text text text</p>
    <p>text text text text <hi rend="italic">text text text text<ref target="n5">
          <hi rend="super">5</hi>
        </ref></hi> text text text text text </p>
  </div>
  <pb n="3"/> text text text...
</body>

I need to wrap the text only between each pb element to a page element. There is a very similar post on Stackoverflow XSLT wrap nodes between specific element from which I adapted the accepted answer. The problem is that it copies all descendant nodes to the output. I only want the text returned, removing any other elements like <head> or <byline> or <p> etc. Just the text values needs to be copied.

Here's my XSLT:

<xsl:template match="tei:text/tei:body">
  <text xmlns="http://digital.library.ptsem.edu/ptsl" type="ocr" source="tei">
    <xsl:variable name="parent" select="."/>
    <xsl:for-each-group select="descendant::node()" group-starting-with="tei:pb[@n]">
      <page number="{@n}" xmlns="http://digital.library.ptsem.edu/ptsl">
        <xsl:apply-templates select="$parent/node()[descendant-or-self::node() intersect current-group()]" mode="subtree"/>     
      </page>        
    </xsl:for-each-group>
  </text> 
</xsl:template>

<xsl:template match="tei:pb[@n]" mode="subtree"/>

<xsl:template match="node()" mode="subtree">
  <xsl:copy>
    <xsl:copy-of select="@*"/>
    <xsl:apply-templates select="node()[descendant-or-self::node() intersect current-group()]" mode="subtree"/>
  </xsl:copy>   
</xsl:template>

Returned result is:

<?xml version="1.0" encoding="UTF-8"?>
<text>
  <page number="1">
    <head>text text <lb/>text text<lb/>text text <ref target="n1">
        <hi rend="super">1</hi>
      </ref>
    </head>
    <byline>text</byline>
    <note type="bio" place="bottom">text text...</note>
    <div>
      <p>text, <title>text</title> text <title>text</title> text text text<title>Saved</title> text text text</p>
      <q>text text text text<ref target="n2">
          <hi rend="super">2</hi>
        </ref></q>
      <p>text <title>text</title> and <title>text, </title>
      </p>
    </div>
  </page>
  <page number="2"> 
    <div>
      <p>
        <title>text</title> text text text</p>
      <p> text text text text <hi rend="italic">text,</hi> "text." text text text <hi rend="italic">text,</hi> text text text<ref target="n3">
          <hi rend="super">3</hi>
        </ref> text text text <hi rend="italic">text</hi> text text text <hi rend="italic">text</hi> text text text text<ref target="n4">
          <hi rend="super">4</hi>
        </ref> text text text text</p>
      <p>text text text text <hi rend="italic">text text text text<ref target="n5">
            <hi rend="super">5</hi>
          </ref></hi> text text text text text </p>
    </div>
  </page>
  <page number="3"> text text text... </page>
</text>

Desired result is:

<text>
  <page number="1">text text text text text text 1 text text text...
    text, text text text text text text Saved text text text
      text text text text 2 text text and text, 
  </page>
  <page number="2">text text text text
      text text text text text, "text." text text text text, text text text 3
         text text text text text text text text text text text text 4
         text text text text text text text text text text text text 5
           text text text text text 
  </page>
  <page number="3"> text text text... </page>
</text>

CodePudding user response:

It seems using

  <page number="{@n}" xmlns="http://digital.library.ptsem.edu/ptsl">
    <xsl:copy-of select="current-group()[self::text()]"/>     
  </page>  

instead of

  <page number="{@n}" xmlns="http://digital.library.ptsem.edu/ptsl">
    <xsl:apply-templates select="$parent/node()[descendant-or-self::node() intersect current-group()]" mode="subtree"/>     
  </page>  

should do to just output the grouped text nodes.

  • Related