Home > front end >  R xml2 : How to query only corresponding xml nodes
R xml2 : How to query only corresponding xml nodes

Time:10-23

I'm trying to read and transform many XML files into R data frames (or preferably Tibbles).

All R packages I've tried, unfortunately (XML, flatxml, xmlconvert) failed when I tried to convert the files using built-in functions (e.g. xmltodataframe from the XML Package and xml_to_df from the xmlconvert package), so I have to do it manually with XML2.

Here is my question with a small working example:

# Minimal Working Example

library(tidyverse)
library(xml2)


interimxml <- read_xml("<Subdivision>
              <Name>Charles</Name>
              <Salary>100</Salary>
              <Name>Laura</Name>
              <Name>Steve</Name>
              <Salary>200</Salary>
              </Subdivision>")

names  <- xml_text(xml_find_all(interimxml ,"//Subdivision/Name"))
salary <- xml_text(xml_find_all(interimxml ,"//Subdivision/Salary"))

names
salary

# combine in to tibble (doesn't work because of inequal vector lengths)
result <- tibble(names=names, 
                 salary = salary)
result

rbind(names, salary)

From the (made up) XML file you can see that Charles earns 100 dollars, Laura earns nothing ( because of the missing entry, here is the problem) and Steve earns 200 dollars.

What I want xml2 do to is, when querying names and salary nodes is to return an "NA" (or zero which would also be okay), when it finds a name but no corresponding salary entry, so that I would end up a nice table like this:

Name Salary
Charles 100
Laura NA
Steve 200

I know that I could modify the "xpath" to only pick up the last value (for Steve), which wouldn't help me, since (in the real data) it could also be the 100th or the 23rd person with missing salary information.

[ I'm aware that Salary Numbers are pulled as character values from the xml file. I would mutate(across(salary, as.double) over columns afterwards.]

Any help is highly appreciated. Thank you very much in advance.

CodePudding user response:

You need to be a bit more careful to match up the names and salaries. Basically first find all the <Name> nodes, then check only if their next sibling is a <Salary> node. If not, then return NA.

nameNodes <- xml_find_all(interimxml ,"//Subdivision/Name")
names <- xml_text(nameNodes)
salary <- map_chr(nameNodes, ~xml_text(xml_find_first(., "./following-sibling::*[1][self::Salary]")))

tibble::tibble(names, salary)
#   names   salary
#   <chr>   <chr> 
# 1 Charles 100   
# 2 Laura   NA    
# 3 Steve   200  
  • Related