Home > front end >  Converting XML to dataframe
Converting XML to dataframe

Time:12-18

I want to convert a XML to a dataframe. I'm aware of XML::xmlToDataFrame, but it gives an error in my case. The XML can be found here: https://api.data.gov.hk/v1/historical-archive/get-file?url=https://resource.data.one.gov.hk/td/traffic-detectors/rawSpeedVol-all.xml&time=20211216-0513

Thanks for all answers!

CodePudding user response:

Since your XML file contains multiple nested children, XML::xmlToDataFrame was giving out error.

I've approached the problem using the naive method but it works! Here's what I've done:

The following code creates a dataframe with the tags inside `'.

library(xml2)
require(XML)

pg <- read_xml("https://s3-ap-southeast-1.amazonaws.com/historical-resource-archive/2021/12/16/https%3A%2F%2Fresource.data.one.gov.hk%2Ftd%2Ftraffic-detectors%2FrawSpeedVol-all.xml/0513")

records <- xml_find_all(pg, "//lane")

nodenames<-xml_name(xml_children(records))
nodevalues<-trimws(xml_text(xml_children(records)))


lane_id <- nodevalues[seq(1, length(nodevalues), 6)]
speed <- nodevalues[seq(2, length(nodevalues), 6)]
occupancy <- nodevalues[seq(3, length(nodevalues), 6)]
volume <- nodevalues[seq(4, length(nodevalues), 6)]
s.d. <- nodevalues[seq(5, length(nodevalues), 6)]
valid <- nodevalues[seq(6, length(nodevalues), 6)]

df <- data.frame(lane_id, speed, occupancy, volume, s.d., valid)
head(df)

The df looks like this:

     lane_id speed occupancy volume s.d. valid
1   Fast Lane    70         0      0    0     Y
2 Middle Lane    76         6      3 11.1     Y
3   Slow Lane    70         6      0    0     Y
4   Fast Lane    82         1      1    0     Y
5 Middle Lane    63         3      1    0     Y
6   Slow Lane    79         2      1    0     Y

If you want to extract the data of <detectors>, you can use the following code:

################ Extract Detector Data #########
records2 <- xml_find_all(pg, "//detector")
vals2 <- trimws(xml_text(records2))

nodenames2 <-xml_name(xml_children(records2))
nodevalues2 <-trimws(xml_text(xml_children(records2)))

detector_id <- nodevalues2[seq(1, length(nodevalues2), 3)]
direction <- nodevalues2[seq(2, length(nodevalues2), 3)]
lanes <- nodevalues2[seq(3, length(nodevalues2), 3)]

df2 <- data.frame(detector_id, direction, lanes)
head(df2)

The df2 looks like this:

  detector_id  direction                                              lanes
1    AID01101 South East Fast Lane70000YMiddle Lane766311.1YSlow Lane70600Y
2    AID01102 North East    Fast Lane82110YMiddle Lane63310YSlow Lane79210Y
3    AID01103 South East   Fast Lane50000YMiddle Lane65210YSlow Lane192310Y
4    AID01104 North East                     Fast Lane50000YSlow Lane63110Y
5    AID01105 North East                     Fast Lane50100YSlow Lane53410Y
6    AID01106 South East                     Fast Lane50300YSlow Lane56510Y

But, as you can notice, the lanes column isn't cleaned as you would like since it is a grandchild tag inside the XML.

Although, you could create a new data frame from df and df2 as you would like.

  •  Tags:  
  • r xml
  • Related