I am trying to parse all sitemaps in a sitemap index. I was able to create an object x which has all the three sitemaps from the index.
I am able to create a separate object for each nested xml and then rbind() it together but I believe a function would be easier. I tried writing a for loop or using sapply but it returns error as I am trying to pass a list of lists into sapply.
My aim is to take all of the xml_children and assign them to a dataframe, as doing it my way over a 50 xml list would be very daunting.
sitemap_index <- read_xml("https://www.bodystore.com/sitemap_index.xml")
sitemap_urls <- xml_children(sitemap_index) %>% xml_to_dataframe() %>% rename (url = loc)
x is contaiting all the urls from the sitemap index
x <- lapply(sitemap_urls$url, read_xml)
#creating an empty dataframe
all_sitemaps <- data.frame()
#saving each part of the list
x1 <- x[[1]] %>% xml_children() %>% xml_to_dataframe()
x2 <- x[[2]] %>% xml_children() %>% xml_to_dataframe()
x3 <- x[[3]] %>% xml_children() %>% xml_to_dataframe()
all_sitemaps <- rbind(x1,x2,x3)
xml_to_dataframe is a custom function that parses xml into a dataframe
xml_to_dataframe <- function(nodeset){
if(class(nodeset) != 'xml_nodeset'){
stop('Input should be "xml_nodeset" class')
}
lst <- lapply(nodeset, function(x){
tmp <- xml2::xml_text(xml2::xml_children(x))
names(tmp) <- xml2::xml_name(xml2::xml_children(x))
return(as.list(tmp))
})
result <- do.call(plyr::rbind.fill, lapply(lst, function(x)
as.data.frame(x, stringsAsFactors = F)))
return(dplyr::as_tibble(result))
}
Thank you very much for help
CodePudding user response:
A combination of lapply
, rbind
, and do.call
should work (all base
functions):
# From URL to data.frame
fun_smu2df <- function(smu)
{
xml_to_dataframe(
xml_read_children(
read_xml(
smu
)))
}
# Stack the list of data.frames
all_sitemaps <- do.call(rbind, lapply(sitemap_urls$url, fun_smu2df))
Of course, the actual code should take in account the cases when the reading fails, when the XML is malformed, when it cannot be represented as data.frame
etc. In that case, a more rugged version would be:
# From URL to data.frame
fun_smu2df <- function(smu)
{
xmlp <- tryCatch(
expr = read_xml(smu),
error = identity
)
if (inherits(xmlp, "error")) {
warning("Reading URL \"", as.character(smu), "\" failed.\n previous: ", xmlp$message)
return (NULL)
}
xmlc <- tryCatch(
expr = xml_read_children(xmlp),
error = identity
)
if (inherits(xmlc, "error")) {
warning("Reading XML children from URL \"", as.character(smu), "\" failed.\n previous: ", xmlc$message)
return (NULL)
}
xdf <- tryCatch(
expr = xml_to_dataframe(xmlc),
error = identity
)
if (inherits(xdf, "error")) {
warning("Converting XML children from URL \"", as.character(smu), "\" to \"data.frame\" failed.\n previous: ", xdf$message)
return (NULL)
}
return (xdf)
}
# Stack the list of data.frames
list_sitemaps <- lapply(sitemap_urls$url, fun_smu2df) # This should not bomb
dt_sitemaps <- tryCatch(
expr = do.call(rbind, list_sitemaps),
error = identity
)
if (inherits(dt_sitemaps, "error")) {
warning("Stacking the \"data.frame\" objects failed.\n previous: ", dt_sitemaps$message)
dt_sitemaps <- NULL
} else if (is.null(dt_sitemaps)) {
warning("No \"data.frame\" objects were retrieved from the specified URLs.")
}
I know, is not pretty (well, I think is pretty :-D), but it won't bite you in the ass when one single URL fails because reasons.