Home > Back-end >  Converting XML to a dataframe R and extracting value from within
Converting XML to a dataframe R and extracting value from within

Time:02-17

I have the following XML which I am trying to convert into a data frame in R or a table within SQL:

<Root>
<Product>
    <Test_ID value="1" />
    <Effective_Date value="2022-01-01" />
    <Membership value="Yes" />
    <Request>
      <Request_ID value="1" />
      <Request_type value="Simple" />
    </Request>
    <Request>
      <Request_ID value="2" />
      <Request_type value="Complex" />
    </Request>
</Product>
</Root>

I've tried using XMLtoDataFrame ("xml" library) but this just creates a table without any values. Would anyone be able to help with this please? Open to a SQL solution.

Edit: I'm looking for my table to be in the following format table

CodePudding user response:

Here's an effort.

Assumptions:

  • Root is always there, and always just one
  • Product can contain zero or more, so we'll iterate smartly and combine
  • I do not assume that $Product$Requests are always perfectly aligned; it can be simplified (hard-coded fewer lines of code) if this is a guaranteed format.

Some helper functions:

func1 <- function(z) if (is.null(names(z))) attr(z, "value") else lapply(z, func1)
merge.list <- function(A, B) {
  # normalize lengths, just in case, since I think you have more than one $Product
  A <- lapply(A, `length<-`, max(lengths(A)))
  B <- lapply(B, `length<-`, max(lengths(B)))
  BnotA <- setdiff(names(B), names(A))
  AnotB <- setdiff(names(A), names(B))
  inboth <- intersect(names(A), names(B))
  A[BnotA] <- replicate(length(BnotA), rep(NA, max(lengths(A))), simplify = FALSE)
  A[AnotB] <- lapply(A[AnotB], function(z) c(z, rep(NA, max(lengths(B)))))
  A[inboth] <- Map(c, A[inboth], B[inboth])
  A
}

Code:

# library(xml2) # read_xml, as_list
L <- xml2::as_list(xml2::read_xml('
<Root>
<Product>
    <Test_ID value="1" />
    <Effective_Date value="2022-01-01" />
    <Membership value="Yes" />
    <Request>
      <Request_ID value="1" />
      <Request_type value="Simple" />
    </Request>
    <Request>
      <Request_ID value="2" />
      <Request_type value="Complex" />
    </Request>
</Product>
</Root>'))

out <- do.call(rbind.data.frame, lapply(func1(L$Root), function(pr) {
  as.data.frame(lapply(split(pr, names(pr)), function(Y) Reduce(merge.list, Y)))
}))
out
#           Effective_Date Membership Request.Request_ID Request.Request_type Test_ID
# Product.1     2022-01-01        Yes                  1               Simple       1
# Product.2     2022-01-01        Yes                  2              Complex       1

(It's a "clean" data.frame:)

str(out)
# 'data.frame': 2 obs. of  5 variables:
#  $ Effective_Date      : chr  "2022-01-01" "2022-01-01"
#  $ Membership          : chr  "Yes" "Yes"
#  $ Request.Request_ID  : chr  "1" "2"
#  $ Request.Request_type: chr  "Simple" "Complex"
#  $ Test_ID             : chr  "1" "1"
  • Related