I am downloading a lot of XML files from which I want to extract some values (the same values in each file). The files are all parsed in the same fashion (I shortened the document a bit):
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<bilans version="1.0" xmlns="fr:inpi:odrncs:bilansSaisisXML">
<bilan>
<identite>
<siren>03827494129</siren>
<code>5610A</code>
<code_motif>00</code_motif>
<denomination><![CDATA[blabla]]></denomination>
<adresse><![CDATA[13100 Aix-en-Provence]]></adresse>
</identite>
<detail>
<page numero="01">
<liasse code="AH" m1="000000000550366" m3="000000000550366" m4="000000000550366"/>
<liasse code="CF" m1="000000000061757" m3="000000000061757" m4="000000000063065"/>
</page>
<page numero="02">
<liasse code="DA" m1="000000000007622" m2="000000000007622"/>
<liasse code="DL" m1="000000000322767" m2="000000000317800"/>
<liasse code="DV" m1="000000000015103"/>
<liasse code="DX" m1="000000000090125" m2="000000000047110"/>
</page>
</detail>
</bilan>
</bilans>
The values I am interested in are in identite
, but also and mostly in the attributes parameters values (if I got XML terminology right). For example, I would like to extract the "m3" value where liasse code = "CF", which means I want both to filter by attribute and extract values also contained in this specific attribute.
The huge constraint is that I am extracting a lot of those xml files (which actually represent firms annual balance sheets), so I'm not sure it would be memory-friendly to extract all of the xml document in an R file, and then filter.
The ressources I browsed focused on extracting values of specific attributes, which is a common operation, but filtering by attribute and extracting the same attribute values is something I did not found with R.
CodePudding user response:
Just an XPath then xml_attr
:
xmlfile %>%
xml_ns_strip() %>%
xml_find_all(xpath = "//liasse[@code='CF']") %>%
xml_attr("m3")
CodePudding user response:
Consider iterating through all <bilan>
nodes and extract underlying descendants, specifically <identite>
nodes and the specific liasse
attribute. Below shows how to parse elements under a default namespace as XML contains: xmlns="fr:inpi:odrncs:bilansSaisisXML"
.
library(xml2)
library(dplyr)
# LOAD XML
doc <- xml2::read_xml("Input.xml")
# USE TEMP fr PREFIX FOR DEFAULT NAMESPACE
nmsp <- c(fr = "fr:inpi:odrncs:bilansSaisisXML")
# RETRIEVE ALL bilan NODES
bilans <- xml_find_all(doc, "//fr:bilan", ns=nmsp)
# ITERATE THROUGH ALL bilan DESCENDANTS
df_list <- lapply(bilans, function(bilan) {
# RETRIEVE identite NODES
ch_recs <- xml_find_all(bilan, "fr:identite/*", ns=nmsp)
# BIND NODE NAMES AND TEXT TO DATA FRAME AND ADD m3 COLUMN
data.frame(rbind(setNames(
c(xml2::xml_text(ch_recs)),
c(xml2::xml_name(ch_recs))
))) %>% mutate(
m3 = xml_text(xml_find_first(
bilan, "fr:detail/fr:page/fr:liasse[@code='CF']/@m3", ns=nmsp
))
)
})
# BIND ALL LIST OF DFs TO SINGLE DF
bilan_df <- dplyr::bind_rows(df_list)
Output
str(bilan_df)
# 'data.frame': 1 obs. of 6 variables:
# $ siren : chr "03827494129"
# $ code : chr "5610A"
# $ code_motif : chr "00"
# $ denomination: chr "blabla"
# $ adresse : chr "13100 Aix-en-Provence"
# $ m3 : chr "000000000061757"
bilan_df
# siren code code_motif denomination adresse m3
# 1 03827494129 5610A 00 blabla 13100 Aix-en-Provence 000000000061757
Above will parse all <bilan>
nodes in a single XML document. Should you need to iterate through many XML documents, run above in a function that receive a file name as input parameter. Then iteratively call function across the XML files. You can do a final file-level bind_rows
:
parse_bilan_data <- function(xml_file) {
# LOAD XML
doc <- xml2::read_xml(xml_file)
...
# BIND ALL LIST OF DFs TO SINGLE DF
bilan_df <- dplyr::bind_rows(df_list) %>% mutate(source=xml_file)
}
xml_files <- list.files(path="/path/to/XML/files", pattern=".xml")
all_bilan_df <- dplyr::bind_rows(
lapply(xml_files, parse_bilan_data)
)