Home > Mobile >  Can't unlist a XML file after parsing
Can't unlist a XML file after parsing

Time:09-23

I'm trying to get data from the open-database of fuel price in France. The data are available Attributes

CodePudding user response:

Tidying a nested list like this is always an annoying problem. My approach is to build a custom function that works on each element, and then use purrr::map() to tidy each element individually.

I've built a custom function below to get you started. It works on the "instantanee" data from the link you provided, since that's what downloaded fastest. The same principles (and maybe even the same code) should apply to the other data sets.

Here's some code to load the data for the first five gas stations:

data_list <- list(pdv = structure(list(adresse = list("RD 93 GRANDE RUE"), 
    ville = list("Camphin-en-Pévèle"), horaires = structure(list(
        jour = structure(list(horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "1", nom = "Lundi", ferme = "1"), 
        jour = structure(list(horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "2", nom = "Mardi", ferme = "1"), 
        jour = structure(list(horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "3", nom = "Mercredi", ferme = "1"), 
        jour = structure(list(horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "4", nom = "Jeudi", ferme = "1"), 
        jour = structure(list(horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "5", nom = "Vendredi", ferme = "1"), 
        jour = structure(list(horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "6", nom = "Samedi", ferme = "1"), 
        jour = structure(list(horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "7", nom = "Dimanche", ferme = "1")), "`automate-24-24`" = "1"), 
    services = list(service = list("Station de gonflage"), service = list(
        "Laverie"), service = list("Lavage automatique"), service = list(
        "Automate CB 24/24")), prix = structure(list(), nom = "Gazole", id = "1", maj = "2021-09-21 13:38:39", valeur = "1.443"), 
    prix = structure(list(), nom = "E85", id = "3", maj = "2021-08-17 11:35:16", valeur = "0.659"), 
    prix = structure(list(), nom = "E10", id = "5", maj = "2021-09-21 13:38:39", valeur = "1.526"), 
    prix = structure(list(), nom = "SP98", id = "6", maj = "2021-09-21 13:38:39", valeur = "1.607")), id = "59780003", latitude = "5059477.455", longitude = "325781.84717474", cp = "59780", pop = "R"), 
    pdv = structure(list(adresse = list("AIRE DE LACQ AUDEJOS SUD"), 
        ville = list("LACQ AUDEJOS SUD"), horaires = structure(list(
            jour = structure(list(horaire = structure(list(), ouverture = "00.00", fermeture = "23.59")), id = "1", nom = "Lundi", ferme = ""), 
            jour = structure(list(horaire = structure(list(), ouverture = "00.00", fermeture = "23.59")), id = "2", nom = "Mardi", ferme = ""), 
            jour = structure(list(horaire = structure(list(), ouverture = "00.00", fermeture = "23.59")), id = "3", nom = "Mercredi", ferme = ""), 
            jour = structure(list(horaire = structure(list(), ouverture = "00.00", fermeture = "23.59")), id = "4", nom = "Jeudi", ferme = ""), 
            jour = structure(list(horaire = structure(list(), ouverture = "00.00", fermeture = "23.59")), id = "5", nom = "Vendredi", ferme = ""), 
            jour = structure(list(horaire = structure(list(), ouverture = "00.00", fermeture = "23.59")), id = "6", nom = "Samedi", ferme = ""), 
            jour = structure(list(horaire = structure(list(), ouverture = "00.00", fermeture = "23.59")), id = "7", nom = "Dimanche", ferme = "")), "`automate-24-24`" = ""), 
        services = list(service = list("Carburant additivé"), 
            service = list("Toilettes publiques"), service = list(
                "Bar"), service = list("Boutique alimentaire"), 
            service = list("Station de gonflage"), service = list(
                "Espace bébé"), service = list("Piste poids lourds")), 
        prix = structure(list(), nom = "Gazole", id = "1", maj = "2021-09-23 00:01:00", valeur = "1.689"), 
        prix = structure(list(), nom = "GPLc", id = "4", maj = "2021-09-23 00:01:00", valeur = "0.969"), 
        prix = structure(list(), nom = "E10", id = "5", maj = "2021-09-23 00:01:00", valeur = "1.789"), 
        prix = structure(list(), nom = "SP98", id = "6", maj = "2021-09-23 00:01:00", valeur = "1.899")), id = "64170012", latitude = "4342142.6", longitude = "-59899.6", cp = "64170", pop = "A"), 
    pdv = structure(list(adresse = list("52 Avenue Léo Lagrange"), 
        ville = list("THIERS"), horaires = structure(list(jour = structure(list(
            horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "1", nom = "Lundi", ferme = "1"), 
            jour = structure(list(horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "2", nom = "Mardi", ferme = "1"), 
            jour = structure(list(horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "3", nom = "Mercredi", ferme = "1"), 
            jour = structure(list(horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "4", nom = "Jeudi", ferme = "1"), 
            jour = structure(list(horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "5", nom = "Vendredi", ferme = "1"), 
            jour = structure(list(horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "6", nom = "Samedi", ferme = "1"), 
            jour = structure(list(horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "7", nom = "Dimanche", ferme = "1")), "`automate-24-24`" = "1"), 
        services = list(service = list("DAB (Distributeur automatique de billets)"), 
            service = list("Automate CB 24/24")), prix = structure(list(), nom = "Gazole", id = "1", maj = "2021-07-01 18:00:00", valeur = "1.530"), 
        prix = structure(list(), nom = "E10", id = "5", maj = "2021-09-09 18:00:00", valeur = "1.654")), id = "63300003", latitude = "4584800", longitude = "353000", cp = "63300", pop = "R"), 
    pdv = structure(list(adresse = list("Avenue de Garossos"), 
        ville = list("Beauzelle"), services = list(service = list(
            "Boutique alimentaire"), service = list("Station de gonflage"), 
            service = list("Vente de gaz domestique (Butane, Propane)"), 
            service = list("Piste poids lourds"), service = list(
                "DAB (Distributeur automatique de billets)"), 
            service = list("Lavage automatique"), service = list(
                "Lavage manuel"), service = list("Vente de fioul domestique"), 
            service = list("Vente de pétrole lampant")), prix = structure(list(), nom = "Gazole", id = "1", maj = "2021-09-19 06:17:34", valeur = "1.432"), 
        prix = structure(list(), nom = "E85", id = "3", maj = "2021-09-19 06:17:35", valeur = "0.649"), 
        prix = structure(list(), nom = "E10", id = "5", maj = "2021-09-19 06:17:35", valeur = "1.559"), 
        prix = structure(list(), nom = "SP98", id = "6", maj = "2021-09-19 06:17:35", valeur = "1.639")), id = "31700007", latitude = "4366800", longitude = "136500", cp = "31700", pop = "R"), 
    pdv = structure(list(adresse = list("Avenue de Brommat"), 
        ville = list("Mur-de-Barrez"), services = list(service = list(
            "Carburant additivé"), service = list("DAB (Distributeur automatique de billets)")), 
        prix = structure(list(), nom = "Gazole", id = "1", maj = "2021-09-22 14:43:59", valeur = "1.510"), 
        prix = structure(list(), nom = "SP95", id = "2", maj = "2021-09-22 14:44:00", valeur = "1.690"), 
        prix = structure(list(), nom = "SP98", id = "6", maj = "2021-09-22 14:44:00", valeur = "1.740")), id = "12600002", latitude = "4484071", longitude = "266470", cp = "12600", pop = "R"))

What a mess.

Here's a function that, when applied to each element of the list, will return a tidy result:

# This function will be applied to each entry in the big list, extracting the 
# data you're interested in and returning it in a tidy data frame.
# I've showed you how to extract a few values to get you started.
# You will need to build the rest of this function by hand, based
# on the specific structure of the data.
parse_vals <-  function(x){
  
  # get the address for this gas station
  address <- pluck(x, "adresse", 1)
  
  # get the lat and longitude
  lat <- attr(x, "latitude")
  lon <- attr(x, "longitude")
  
  # get gas data in a data frame
  # note that for some gas stations there are several list items with the same 
  # name ("prix" in this case) so we need to index in the way done below--just
  # doing `x$prix` will return only the first entry named `prix`
  gas <- purrr::map_dfr(x[names(x) == "prix"], attributes)

  # put all of our results together
  tibble(address = address,
         lat = lat,
         lon = lon) %>%
    bind_cols(gas)
  
}

I'm using the standard tidyverse suite and the package xml2 to load the file. Which you can then use like this:

library(tidyverse)
library(xml2)

# Note this is how I loaded the full dataset: if you're using the definition of data_list I posted above using `dput()`, keep this commented out.
#data <- xml2::read_xml(filename)
#data_list <- xml2::as_list(data)[[1]]

data_list %>%
  head(5) %>%
  purrr::map_dfr(parse_vals)

And it should give you a nice output like this:

# A tibble: 17 x 7
   address                  lat         lon             nom    id    maj                 valeur
   <chr>                    <chr>       <chr>           <chr>  <chr> <chr>               <chr> 
 1 RD 93 GRANDE RUE         5059477.455 325781.84717474 Gazole 1     2021-09-21 13:38:39 1.443 
 2 RD 93 GRANDE RUE         5059477.455 325781.84717474 E85    3     2021-08-17 11:35:16 0.659 
 3 RD 93 GRANDE RUE         5059477.455 325781.84717474 E10    5     2021-09-21 13:38:39 1.526 
 4 RD 93 GRANDE RUE         5059477.455 325781.84717474 SP98   6     2021-09-21 13:38:39 1.607 
 5 AIRE DE LACQ AUDEJOS SUD 4342142.6   -59899.6        Gazole 1     2021-09-23 00:01:00 1.689 
 6 AIRE DE LACQ AUDEJOS SUD 4342142.6   -59899.6        GPLc   4     2021-09-23 00:01:00 0.969 
 7 AIRE DE LACQ AUDEJOS SUD 4342142.6   -59899.6        E10    5     2021-09-23 00:01:00 1.789 
 8 AIRE DE LACQ AUDEJOS SUD 4342142.6   -59899.6        SP98   6     2021-09-23 00:01:00 1.899 
 9 52 Avenue Léo Lagrange   4584800     353000          Gazole 1     2021-07-01 18:00:00 1.530 
10 52 Avenue Léo Lagrange   4584800     353000          E10    5     2021-09-09 18:00:00 1.654

If you want more data, you can inspect the structure of data_list and add to the function parse_vals().

Note that this data is poorly formed for R, since it returns lists that have many entries with the same name, like prix. As a result, if you just do x$prix you'll get only the first entry named prix. That's why I've indexed it using x[names(x) == "prix"]. You may need to use this trick again.

  •  Tags:  
  • r xml
  • Related