Home > Software engineering >  How to extract second attrbute from xml file line in R
How to extract second attrbute from xml file line in R

Time:10-08

I need to extract certain attributes from an xml file that has the same name of a node, but different number of attributes per node. The file is located here:

https://boardgamegeek.com//xmlapi//boardgame//13&type=boardgame,boardgameexpansion,boardgameaccesory,rpgitem,rpgissue,videogame&versions=1&stats=1&videos=1&marketplace=1&comments=1&pricehistory=1

And here is a small portion of the file itself:

<boardgames termsofuse="https://boardgamegeek.com/xmlapi/termsofuse">
  <boardgame objectid="13">
     <yearpublished>1995</yearpublished>
     <minplayers>3</minplayers>
     <maxplayers>4</maxplayers>
     <playingtime>120</playingtime>
     <minplaytime>60</minplaytime>
     <maxplaytime>120</maxplaytime>
     <age>10</age>
     <name sortindex="1">Catan</name>
     <name primary="true" sortindex="1">CATAN</name>
     <name sortindex="1">Catan (Колонизаторы)</name>
     <name sortindex="1">Catan telepesei</name>
     <name sortindex="1">Catan: Das Spiel</name>
     <name sortindex="1">Catan: Die Bordspel</name>
     <name sortindex="1">Catan: El Juego</name>
     <name sortindex="1">Catan: Gra planszowa</name>
     <name sortindex="1">Catan: Il Gioco</name>
     <name sortindex="1">Catan: Landnemarnir</name>

I want to extract only the value for "sortindex" from each line with "name" as the node name. I have tried the following, but it returns both the primary "true" and the sort index value for the second "name" node. I've tried so many different ways, and I can't get it to work. I've tried xmlGetAttr and others. How do I get this simple operation to work?

data <- read_xml(url)
xmlfile <- xmlParse(data)
xmltop = xmlRoot(xmlfile)
xmlSApply(getNodeSet(xmltop, '//name[@sortindex]'), xmlAttrs)

> xmlSApply(getNodeSet(xmltop, '//name[@primary]'), xmlAttrs)
             [,1]  
  primary   "true"
  sortindex "1"   

CodePudding user response:

It sounds like you want to include any name node, even if it doesn't have the attribute. If so, you can try the following:

data <- read_xml('https://boardgamegeek.com//xmlapi//boardgame//13&type=boardgame,boardgameexpansion,boardgameaccesory,rpgitem,rpgissue,videogame&versions=1&stats=1&videos=1&marketplace=1&comments=1&pricehistory=1')
xmlfile <- xmlParse(data)
xmltop <- xmlRoot(xmlfile)

getAttr <- function(x, attrName) {
  attrs <- xmlAttrs(x)
  if (attrName %in% names(attrs)) {
    attrs[[attrName]]
  } else {
    NA
  }
}

xmlSApply(getNodeSet(xmltop, '//name'), function(x)getAttr(x, "sortindex"))

xmlSApply(getNodeSet(xmltop, '//name'), function(x)getAttr(x, "primary"))

If you don't want to include nodes without the attribute, then you can do something very similar:

library(xml2)
library(XML)


data <- read_xml('https://boardgamegeek.com//xmlapi//boardgame//13&type=boardgame,boardgameexpansion,boardgameaccesory,rpgitem,rpgissue,videogame&versions=1&stats=1&videos=1&marketplace=1&comments=1&pricehistory=1')
xmlfile <- xmlParse(data)
xmltop <- xmlRoot(xmlfile)


xmlSApply(getNodeSet(xmltop, '//name[@sortindex]'), function(x)xmlAttrs(x)[['sortindex']])

xmlSApply(getNodeSet(xmltop, '//name[@primary]'), function(x)xmlAttrs(x)[['primary']])
  • Related