Filter by extract values of attributes in an XML file inR-CodePudding

I am downloading a lot of XML files from which I want to extract some values (the same values in each file). The files are all parsed in the same fashion (I shortened the document a bit):

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<bilans version="1.0" xmlns="fr:inpi:odrncs:bilansSaisisXML">
<bilan>
<identite>
<siren>03827494129</siren>
<code>5610A</code>
<code_motif>00</code_motif>
<denomination><![CDATA[blabla]]></denomination>
<adresse><![CDATA[13100 Aix-en-Provence]]></adresse>
</identite>
<detail>
<page numero="01">
<liasse code="AH" m1="000000000550366" m3="000000000550366" m4="000000000550366"/>
<liasse code="CF" m1="000000000061757" m3="000000000061757" m4="000000000063065"/>
</page>
<page numero="02">
<liasse code="DA" m1="000000000007622" m2="000000000007622"/>
<liasse code="DL" m1="000000000322767" m2="000000000317800"/>
<liasse code="DV" m1="000000000015103"/>
<liasse code="DX" m1="000000000090125" m2="000000000047110"/>
</page>
</detail>
</bilan>
</bilans>

The values I am interested in are in identite, but also and mostly in the attributes parameters values (if I got XML terminology right). For example, I would like to extract the "m3" value where liasse code = "CF", which means I want both to filter by attribute and extract values also contained in this specific attribute.

The huge constraint is that I am extracting a lot of those xml files (which actually represent firms annual balance sheets), so I'm not sure it would be memory-friendly to extract all of the xml document in an R file, and then filter.

The ressources I browsed focused on extracting values of specific attributes, which is a common operation, but filtering by attribute and extracting the same attribute values is something I did not found with R.

CodePudding user response：

Just an XPath then xml_attr:

xmlfile %>%
  xml_ns_strip() %>%
  xml_find_all(xpath = "//liasse[@code='CF']") %>%
  xml_attr("m3")

CodePudding user response：

Consider iterating through all <bilan> nodes and extract underlying descendants, specifically <identite> nodes and the specific liasse attribute. Below shows how to parse elements under a default namespace as XML contains: xmlns="fr:inpi:odrncs:bilansSaisisXML".

library(xml2)
library(dplyr)

# LOAD XML
doc <- xml2::read_xml("Input.xml")

# USE TEMP fr PREFIX FOR DEFAULT NAMESPACE
nmsp <- c(fr = "fr:inpi:odrncs:bilansSaisisXML")

# RETRIEVE ALL bilan NODES
bilans <- xml_find_all(doc, "//fr:bilan", ns=nmsp)

# ITERATE THROUGH ALL bilan DESCENDANTS
df_list <- lapply(bilans, function(bilan) {
  # RETRIEVE identite NODES
  ch_recs <- xml_find_all(bilan, "fr:identite/*", ns=nmsp)
  
  # BIND NODE NAMES AND TEXT TO DATA FRAME AND ADD m3 COLUMN
  data.frame(rbind(setNames(
    c(xml2::xml_text(ch_recs)), 
    c(xml2::xml_name(ch_recs))
  ))) %>% mutate(
    m3 = xml_text(xml_find_first(
      bilan, "fr:detail/fr:page/fr:liasse[@code='CF']/@m3", ns=nmsp
    ))
  )
})

# BIND ALL LIST OF DFs TO SINGLE DF
bilan_df <- dplyr::bind_rows(df_list)

Output

str(bilan_df)
# 'data.frame': 1 obs. of  6 variables:
#  $ siren       : chr "03827494129"
#  $ code        : chr "5610A"
#  $ code_motif  : chr "00"
#  $ denomination: chr "blabla"
#  $ adresse     : chr "13100 Aix-en-Provence"
#  $ m3          : chr "000000000061757"

bilan_df
#         siren  code code_motif denomination               adresse              m3
# 1 03827494129 5610A         00       blabla 13100 Aix-en-Provence 000000000061757

Above will parse all <bilan> nodes in a single XML document. Should you need to iterate through many XML documents, run above in a function that receive a file name as input parameter. Then iteratively call function across the XML files. You can do a final file-level bind_rows:

parse_bilan_data <- function(xml_file) {
    # LOAD XML
    doc <- xml2::read_xml(xml_file)

    ...

    # BIND ALL LIST OF DFs TO SINGLE DF
    bilan_df <- dplyr::bind_rows(df_list) %>% mutate(source=xml_file)
}

xml_files <- list.files(path="/path/to/XML/files", pattern=".xml")

all_bilan_df <- dplyr::bind_rows(
    lapply(xml_files, parse_bilan_data)
)