Home > Net >  How to iterate over a nested list in R and assign values to data.frame
How to iterate over a nested list in R and assign values to data.frame

Time:03-17

I am trying to parse all sitemaps in a sitemap index. I was able to create an object x which has all the three sitemaps from the index.

I am able to create a separate object for each nested xml and then rbind() it together but I believe a function would be easier. I tried writing a for loop or using sapply but it returns error as I am trying to pass a list of lists into sapply.

My aim is to take all of the xml_children and assign them to a dataframe, as doing it my way over a 50 xml list would be very daunting.

sitemap_index <- read_xml("https://www.bodystore.com/sitemap_index.xml")

sitemap_urls <- xml_children(sitemap_index) %>% xml_to_dataframe() %>% rename (url = loc) 
x is contaiting all the urls from the sitemap index
x <- lapply(sitemap_urls$url, read_xml)

#creating an empty dataframe
all_sitemaps <- data.frame()

#saving each part of the list 
x1 <- x[[1]] %>% xml_children() %>% xml_to_dataframe()
x2 <- x[[2]] %>% xml_children() %>% xml_to_dataframe() 
x3 <- x[[3]] %>% xml_children() %>% xml_to_dataframe() 
all_sitemaps <- rbind(x1,x2,x3)

xml_to_dataframe is a custom function that parses xml into a dataframe

xml_to_dataframe <- function(nodeset){
  if(class(nodeset) != 'xml_nodeset'){
    stop('Input should be "xml_nodeset" class')
  }
  lst <- lapply(nodeset, function(x){
    tmp <- xml2::xml_text(xml2::xml_children(x))
    names(tmp) <- xml2::xml_name(xml2::xml_children(x))
    return(as.list(tmp))
  })
  result <- do.call(plyr::rbind.fill, lapply(lst, function(x)
    as.data.frame(x, stringsAsFactors = F)))
  return(dplyr::as_tibble(result))
}

Thank you very much for help

CodePudding user response:

A combination of lapply, rbind, and do.call should work (all base functions):

# From URL to data.frame
fun_smu2df <- function(smu)
{
    xml_to_dataframe(
    xml_read_children(
    read_xml(
       smu
    )))
}

# Stack the list of data.frames
all_sitemaps <- do.call(rbind, lapply(sitemap_urls$url, fun_smu2df))

Of course, the actual code should take in account the cases when the reading fails, when the XML is malformed, when it cannot be represented as data.frame etc. In that case, a more rugged version would be:

# From URL to data.frame
fun_smu2df <- function(smu)
{
    xmlp <- tryCatch(
       expr  = read_xml(smu),
       error = identity
    )
    if (inherits(xmlp, "error")) {
        warning("Reading URL \"", as.character(smu), "\" failed.\n  previous: ", xmlp$message)
        return (NULL)
    }

    xmlc <- tryCatch(
       expr  = xml_read_children(xmlp),
       error = identity
    )
    if (inherits(xmlc, "error")) {
        warning("Reading XML children from URL \"", as.character(smu), "\" failed.\n  previous: ", xmlc$message)
        return (NULL)
    }

    xdf <- tryCatch(
       expr  = xml_to_dataframe(xmlc),
       error = identity
    )
    if (inherits(xdf, "error")) {
        warning("Converting XML children from URL \"", as.character(smu), "\" to \"data.frame\" failed.\n  previous: ", xdf$message)
        return (NULL)
    }

    return (xdf)
}

# Stack the list of data.frames
list_sitemaps <- lapply(sitemap_urls$url, fun_smu2df)  # This should not bomb
dt_sitemaps   <- tryCatch(
   expr  = do.call(rbind, list_sitemaps),
   error = identity
)
if (inherits(dt_sitemaps, "error")) {
    warning("Stacking the \"data.frame\" objects failed.\n  previous: ", dt_sitemaps$message)
    dt_sitemaps <- NULL
} else if (is.null(dt_sitemaps)) {
    warning("No \"data.frame\" objects were retrieved from the specified URLs.")
}

I know, is not pretty (well, I think is pretty :-D), but it won't bite you in the ass when one single URL fails because reasons.

  • Related