Home > Back-end >  How can I access and read multiple XML format files within a folder using R?
How can I access and read multiple XML format files within a folder using R?

Time:12-13

I have a local folder that contains 64 individual EVENTLOGSTATE files which are in XML format that I'm trying to access and read into R. I'm able to access the folder and list out all the specific files within that folder, but then when I try to use xmlParse from library(XML) to read in the files, it gives me an error that XML content does not seem to be XML.

For reference, I've created an example of my list.file line, my xmlParse line and the returned error as well as an example of file names within the folder along with what data is in each file.

list.files(path = "C:\\Users\\OneDrive\\Documents\\XML") #pulls list of file names within the XML folder

xmlParse(list.files(path = "C:\\Users\\OneDrive\\Documents\\XML"))
> xmlParse(list.files(path = "C:\\Users\\OneDrive\\Documents\\XML"))
Error: XML content does not seem to be XML: 'f5e450.eventLogState
EventLog-0e6f76b3-12bc-4d4a-aab6-a97600f5f46b.eventLogState
EventLog-11fbd569-4fd5-4bbe-89aa-a9df01378901.eventLogState
EventLog-151c1acc-0062-4f97-989a-a9d7015233f1.eventLogState

Each EventLog file contains data about recorded sessions that I need to be able to pull out the recording start and end times and then create a data frame along with calculations on the total length and visuals. But all of the files are separate and include information in this format:

<?xml version="1.0" encoding="utf-8"?>
<EventLogState xmlns:i="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schemas.datacontract.org/2004/07/Panopto.Recorder">
  <AttemptCount>5</AttemptCount>
  <ErrorInfo>Unable to generate event logs</ErrorInfo>
  <FileInfo i:nil="true" />
  <PanoptoSiteFQDN>hosted.panopto.com</PanoptoSiteFQDN>
  <RecordingEndTime>2018-10-11T12:13:38.1115286-04:00</RecordingEndTime>
  <RecordingId>0e6f76b3-12bc-4d4a-aab6-a97600f5f46b</RecordingId>
  <RecordingStartTime>2018-10-11T11:04:04.9321231-04:00</RecordingStartTime>
  <SessionId>c3c84fee-836b-4d30-8115-a97600f85490</SessionId>
  <Status>Error</Status>
</EventLogState>

I tried this loop solution, but it just returns a tibble 0 x 0

library(xml2)
library(dplyr)
files <- list.files(path = "C:\\Users\\OneDrive\\Documents\\XML")
dfs <-lapply(files, function(files) {
  page <- read_xml(file)
  id <- xml_find_first(out, "//EventLogState") %>% xml_attr("xmlns:i") 
  end.time <- xml_find_first(out, ".//RecordingEndTime") %>% xml_text()
  start.time <- xml_find_first(out, ".//RecordingStartTime") %>% xml_text()
  data.frame(id, end.time, start.time)
})

#combine all results into 1 data frame
answer <- bind_rows(dfs)
answer

Any ideas on how to get the xmlParse line to recognize each individual file and pull in a combined text version to work with?

CodePudding user response:

That was a good start. These files have a namespace associated with them, which does throw in a curve ball. The easiest way to handle the namespaces is to strip them out.
Also, ensure the correct file is referenced in the xml_find() functions.

This should now work for you:

library(xml2)
library(dplyr)
files <- list.files(path = "C:\\Users\\OneDrive\\Documents\\XML")
dfs <-lapply(files, function(file) {
   page <- read_xml(file)
   # #   Check for a namespeace
   #    xml_ns(page)
   # #   It is easier to work with the file if the namespace is removed
   xml_ns_strip(page)
   id <- xml_find_first(page, ".//RecordingId") %>% xml_text()
   end.time <- xml_find_first(page, ".//RecordingEndTime") %>% xml_text()
   start.time <- xml_find_first(page, ".//RecordingStartTime") %>% xml_text()
   data.frame(id, end.time, start.time)
})

#combine all results into 1 data frame
answer <- bind_rows(dfs)
answer

The above code assumes only one "EventLogState" node per file.

  • Related