xml to dataframe error because of duplicate subscripts for columns-CodePudding

I am just trying to create a xml file into a dataframe. My file is big and probably because of that i am getting an error when I try to do it. The error is the followingL: Error in [<-.data.frame(*tmp*, i, names(nodes[[i]]), value = c(graphics = "", : duplicate subscripts for columns the code i used is just this:

breast_cancer <- xmlParse("breastcancerkegg.xml")

xmldaraframe <- xmlToDataFrame("breastcancerkegg.xml")

if you need a sample of the xml file see the following genome.jp/pathway/hsa05224 H00031

<?xml version="1.0"?>
<!DOCTYPE pathway SYSTEM "https://www.kegg.jp/kegg/xml/KGML_v0.7.2_.dtd">
<!-- Creation date: Apr 11, 2018 17:05:28  0900 (GMT 9) -->
<pathway name="path:hsa05224" org="hsa" number="05224"
         title="Breast cancer"
         image="https://www.kegg.jp/kegg/pathway/hsa/hsa05224.png"
         link="https://www.kegg.jp/kegg-bin/show_pathway?hsa05224">
    <entry id="2" name="path:hsa05224" type="map"
        link="https://www.kegg.jp/dbget-bin/www_bget?hsa05224">
        <graphics name="TITLE:Breast cancer" fgcolor="#000000" bgcolor="#FFFFFF"
             type="roundrectangle" x="123" y="58" width="165" height="25"/>
    </entry>
    <entry id="6" name="hsa:2099 hsa:2100" type="gene"
        link="https://www.kegg.jp/dbget-bin/www_bget?hsa:2099 hsa:2100">
        <graphics name="ESR1, ER, ESR, ESRA, ESTRR, Era, NR3A1..." fgcolor="#000000" bgcolor="#BFFFBF"
             type="rectangle" x="710" y="187" width="46" height="17"/>
    </entry>
    <entry id="7" name="hsa:2099 hsa:2100" type="gene"
        link="https://www.kegg.jp/dbget-bin/www_bget?hsa:2099 hsa:2100">
        <graphics name="ESR1, ER, ESR, ESRA, ESTRR, Era, NR3A1..." fgcolor="#000000" bgcolor="#BFFFBF"
             type="rectangle" x="1041" y="187" width="46" height="17"/>
    </entry>
    <entry id="9" name="cpd:C00951" type="compound"
        link="https://www.kegg.jp/dbget-bin/www_bget?C00951">
        <graphics name="C00951" fgcolor="#000000" bgcolor="#FFFFFF"
             type="circle" x="237" y="190" width="8" height="8"/>
    </entry>
    <entry id="11" name="hsa:1956" type="gene"
        link="https://www.kegg.jp/dbget-bin/www_bget?hsa:1956">
        <graphics name="EGFR, ERBB, ERBB1, HER1, NISBD2, PIG61, mENA" fgcolor="#000000" bgcolor="#BFFFBF"
             type="rectangle" x="325" y="593" width="46" height="17"/>
    </entry>

Any idea please?

CodePudding user response：

This is an answer to the comment by OP: how to access specific nodes in an XML document.

I prefer the xml2 library over the XML library. The xml2 vignettes are recommended reading / reference material.

Understanding the data format

The first step is to open the file in an XML browser or in a plain text editor, to look at the structure of the document. We can see that there is one pathway node, containing multiple entry and relation nodes. Each entry node contains a graphics node, and each relation node contains a subtype node.

Loading XML file

library(xml2)
xml <- read_xml('hsa05224.xml')

Extracting all `entry` nodes

entries <- xml_find_all(xml, "/pathway//entry")

Here, the / in the path refers to the root of the document and // is used to separate nodes from their subnodes.

entries will now be a list of 130 items (nodes). Each node can have multiple attributes and multiple children. Attributes have a unique name, children can have identical names.

Getting information from single nodes

> # print entire node
> entries[[1]]
{xml_node}
<entry id="2" name="path:hsa05224" type="map" link="https://www.kegg.jp/dbget-bin/www_bget?hsa05224">
[1] <graphics name="TITLE:Breast cancer" fgcolor="#000000" bgcolor="#FFFFFF" type="roundrectangle" x="123" y="58" width="165" height="25"/>
 
> # get node path
> xml_path(entries[[1]])
[1] "/pathway/entry[1]"
 
> # get vector of all attributes
> xml_attrs(entries[[1]])
                                               id                                              name                                              type 
                                              "2"                                   "path:hsa05224"                                             "map" 
                                             link 
"https://www.kegg.jp/dbget-bin/www_bget?hsa05224" 
 
> # get specific attributes
> xml_attr(entries[[1]], 'name')
[1] "path:hsa05224"

> xml_attr(entries[[1]], 'link')
[1] "https://www.kegg.jp/dbget-bin/www_bget?hsa05224"
 
> # get node name
> xml_name(entries[[1]])
[1] "entry"
 
> # get all sub-nodes inside node
> kids <- xml_children(entries[[1]])

In the last step we extracted all children nodes, on each of which we can perform the same operations again, eg:

> xml_path(kids[[1]])
[1] "/pathway/entry[1]/graphics"

> xml_attrs(kids[[1]])
                 name               fgcolor               bgcolor                  type                     x                     y 
"TITLE:Breast cancer"             "#000000"             "#FFFFFF"      "roundrectangle"                 "123"                  "58" 
                width                height 
                "165"                  "25" 

> xml_name(kids[[1]])
[1] "graphics"

Manually extracting data from these documents mostly involves a lot of for() and/or lapply()-calls.

Understanding the data format

Loading XML file

Extracting all entry nodes

Getting information from single nodes

Extracting all `entry` nodes