R: How to expand a data.frame based on nested results in xml file-CodePudding

<StudyFieldsResponse>
    <StudyFieldsList>
      <StudyFields Rank="2">
      <FieldValues Field="id">
        <FieldValue>327635</FieldValue>
      </FieldValues>
      <FieldValues Field="Gender">
        <FieldValue>Male</FieldValue>
        <FieldValue>Female</FieldValue>
      </FieldValues>
      <FieldValues Field="code">
        <FieldValue>55905</FieldValue>
      </FieldValues>
     </StudyFields>
    <StudyFields Rank="3">
      <FieldValues Field="id">
        <FieldValue>555828</FieldValue>
      </FieldValues>
      <FieldValues Field="Gender">
        <FieldValue>Male</FieldValue>
      </FieldValues>
      <FieldValues Field="code">
        <FieldValue>55407-1139</FieldValue>
        <FieldValue>77030</FieldValue>
        <FieldValue>90901</FieldValue>
        <FieldValue>23144</FieldValue>
      </FieldValues>
    </StudyFields>
  </StudyFieldsList>
</StudyFieldsResponse>

I have the above .xml file. I parsed it as follows in order to extract the id, Gender, and code records.

library(XML)
dat <- xmlParse(file = "example.xml")
final_dat <- xmlToDataFrame(nodes = xmlChildren(xmlRoot(dat)[["StudyFieldsList"]]))
names(final_dat) <- c("id", "Gender", "code")
> final_dat
      id     Gender                      code
1 327635 MaleFemale                     55905
2 555828       Male 55407-1139770309090123144

However, notice that for the first row, there are 2 Genders, male and female. Similarly, for the second one, there are more than 1 codes. How can I expand my data.frame so that the data.frame contains a unique row for each unique id, Gender, and code combination?

> final_dat_expanded
          id     Gender            code
    1 327635       Male           55905
    2 327635     Female           55905
    3 555828       Male      55407-1139
    4 555828       Male           77030
    5 555828       Male           90901
    6 555828       Male           23144

CodePudding user response：

For complex, nested XML, consider XSLT, the special-purpose language designed to transform XML files. On Unix (Mac/Linux), R can run XSLT 1.0 scripts by making a system call to open-source utility, xsltproc. On Windows, R can run a parameterized PowerShell script.

Specifically, below XSLT targets the lowest level FieldValue under code and then retrieves ancestor nodes conditionally for id and gender, mapping through all nodes of gender with for-each.

XSLT (save as .xsl, a special .xml file)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" omit-xml-declaration="no" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="/StudyFieldsResponse">
     <xsl:copy>
       <xsl:apply-templates select="descendant::FieldValues[@Field='code']/FieldValue"/>
     </xsl:copy>
    </xsl:template>
    
    <xsl:template match="FieldValues[@Field='code']/FieldValue">
     <xsl:variable name="curr_code"><xsl:value-of select="text()"/></xsl:variable>
     <xsl:for-each select="ancestor::StudyFields/FieldValues[@Field='Gender']/FieldValue">
       <xsl:copy>
           <id><xsl:value-of select="ancestor::StudyFields/FieldValues[@Field='id']/FieldValue"/></id>
           <gender><xsl:copy-of select="text()"/></gender>
           <code><xsl:copy-of select="$curr_code"/></code>
       </xsl:copy>
     </xsl:for-each>
    </xsl:template>
</xsl:stylesheet>

Online Demo

library(XML)

# TRANSFORM INPUT TO FLATTER OUTPUT AND SAVE TO DISK
system("xsltproc style.xsl input.xml -o output.xml")
doc <- xmlParse("output.xml")

# BIND TO DATA FRAME
field_values_df <- xmlToDataFrame(doc)
field_values_df
#       id gender       code
# 1 327635   Male      55905
# 2 327635 Female      55905
# 3 555828   Male 55407-1139
# 4 555828   Male      77030
# 5 555828   Male      90901
# 6 555828   Male      23144

Should you use Windows, consider PowerShell script

PowerShell (save as .ps1)

param ($xml, $xsl, $output)

$xslt = New-Object System.Xml.Xsl.XslCompiledTransform
$xslt.Load($xsl)
$xslt.Transform($xml, $output)

library(XML)

# TRANSFORM INPUT TO FLATTER OUTPUT AND SAVE TO DISK
system("PowerShell -ExecutionPolicy bypass -File transform.ps1 input.xml style.xsl output.xml")
doc <- xmlParse("output.xml")

# BIND TO DATA FRAME
field_values_df <- xmlToDataFrame(doc)