<StudyFieldsResponse>
<StudyFieldsList>
<StudyFields Rank="2">
<FieldValues Field="id">
<FieldValue>327635</FieldValue>
</FieldValues>
<FieldValues Field="Gender">
<FieldValue>Male</FieldValue>
<FieldValue>Female</FieldValue>
</FieldValues>
<FieldValues Field="code">
<FieldValue>55905</FieldValue>
</FieldValues>
</StudyFields>
<StudyFields Rank="3">
<FieldValues Field="id">
<FieldValue>555828</FieldValue>
</FieldValues>
<FieldValues Field="Gender">
<FieldValue>Male</FieldValue>
</FieldValues>
<FieldValues Field="code">
<FieldValue>55407-1139</FieldValue>
<FieldValue>77030</FieldValue>
<FieldValue>90901</FieldValue>
<FieldValue>23144</FieldValue>
</FieldValues>
</StudyFields>
</StudyFieldsList>
</StudyFieldsResponse>
I have the above .xml
file. I parsed it as follows in order to extract the id
, Gender
, and code
records.
library(XML)
dat <- xmlParse(file = "example.xml")
final_dat <- xmlToDataFrame(nodes = xmlChildren(xmlRoot(dat)[["StudyFieldsList"]]))
names(final_dat) <- c("id", "Gender", "code")
> final_dat
id Gender code
1 327635 MaleFemale 55905
2 555828 Male 55407-1139770309090123144
However, notice that for the first row, there are 2 Gender
s, male and female. Similarly, for the second one, there are more than 1 code
s. How can I expand my data.frame so that the data.frame contains a unique row for each unique id
, Gender
, and code
combination?
> final_dat_expanded
id Gender code
1 327635 Male 55905
2 327635 Female 55905
3 555828 Male 55407-1139
4 555828 Male 77030
5 555828 Male 90901
6 555828 Male 23144
CodePudding user response:
For complex, nested XML, consider XSLT, the special-purpose language designed to transform XML files. On Unix (Mac/Linux), R can run XSLT 1.0 scripts by making a system
call to open-source utility, xsltproc
. On Windows, R can run a parameterized PowerShell script.
Specifically, below XSLT targets the lowest level FieldValue under code and then retrieves ancestor nodes conditionally for id and gender, mapping through all nodes of gender with for-each
.
XSLT (save as .xsl, a special .xml file)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" omit-xml-declaration="no" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/StudyFieldsResponse">
<xsl:copy>
<xsl:apply-templates select="descendant::FieldValues[@Field='code']/FieldValue"/>
</xsl:copy>
</xsl:template>
<xsl:template match="FieldValues[@Field='code']/FieldValue">
<xsl:variable name="curr_code"><xsl:value-of select="text()"/></xsl:variable>
<xsl:for-each select="ancestor::StudyFields/FieldValues[@Field='Gender']/FieldValue">
<xsl:copy>
<id><xsl:value-of select="ancestor::StudyFields/FieldValues[@Field='id']/FieldValue"/></id>
<gender><xsl:copy-of select="text()"/></gender>
<code><xsl:copy-of select="$curr_code"/></code>
</xsl:copy>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
R
library(XML)
# TRANSFORM INPUT TO FLATTER OUTPUT AND SAVE TO DISK
system("xsltproc style.xsl input.xml -o output.xml")
doc <- xmlParse("output.xml")
# BIND TO DATA FRAME
field_values_df <- xmlToDataFrame(doc)
field_values_df
# id gender code
# 1 327635 Male 55905
# 2 327635 Female 55905
# 3 555828 Male 55407-1139
# 4 555828 Male 77030
# 5 555828 Male 90901
# 6 555828 Male 23144
Should you use Windows, consider PowerShell script
PowerShell (save as .ps1)
param ($xml, $xsl, $output)
$xslt = New-Object System.Xml.Xsl.XslCompiledTransform
$xslt.Load($xsl)
$xslt.Transform($xml, $output)
R
library(XML)
# TRANSFORM INPUT TO FLATTER OUTPUT AND SAVE TO DISK
system("PowerShell -ExecutionPolicy bypass -File transform.ps1 input.xml style.xsl output.xml")
doc <- xmlParse("output.xml")
# BIND TO DATA FRAME
field_values_df <- xmlToDataFrame(doc)