Home > Enterprise >  How to create a dataframe using information in an XML?
How to create a dataframe using information in an XML?

Time:12-01

I have an existing dataframe (df1) that looks like this:

    txnID   date        repID
1   1001    8/5/2020    203
2   1002    6/28/2020   887
3   1003    6/28/2020   887
4   1004    4/16/2020   887
5   1005    6/14/2020   887
6   1006    5/19/2020   332
7   1019    9/23/2020   100
8   1027    12/4/2020   221

I also have an XML that looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<salesteam>
  <rep rID="r100">
    <firstName>Helmut</firstName>
    <lastName>Schwab</lastName>
    <territory>EMEA</territory>
  </rep>
  <rep rID="r887">
    <firstName>Walison</firstName>
    <lastName>da Silva</lastName>
    <territory>South America</territory>
  </rep>
  <rep rID="r332">
    <firstName>Lynette</firstName>
    <lastName>McRowe</lastName>
    <territory>East</territory>
  </rep>
  <rep rID="r203">
    <firstName>Aneeta</firstName>
    <lastName>Kappoorthy</lastName>
    <territory>West</territory>
  </rep>
  <rep rID="r221">
    <firstName>Veronika</firstName>
    <lastName>Sixt</lastName>
    <territory>EMEA</territory>
  </rep>
</salesteam>

I want to create a new dataframe that assigns the repID to the appropriate lastName according to the xml and creates a new dataframe (df3) including the data from df1.

head(df3)
    txnID   date        repID   lastName
1   1001    8/5/2020    203     Kappoorthy
2   1002    6/28/2020   887     da Silva
3   1003    6/28/2020   887     da Silva
4   1004    4/16/2020   887     da Silva
5   1005    6/14/2020   887     da Silva
6   1006    5/19/2020   332     McRowe
7   1019    9/23/2020   100     Schwab
8   1027    12/4/2020   221     Sixt

CodePudding user response:

You’ll first have to wrangle your XML into a dataframe, which you can do with xml2. Then, you can just merge in the info from the XML dataframe into df1 using dplyr::left_join().

library(xml2)
library(tidyr)
library(dplyr)

xml_df <- tibble(
    xml = as_list(read_xml(xml))[[1]],
    repID = as.numeric(gsub(
      "^r", 
      "", 
      sapply(xml, attr, "rID")
    ))
  ) %>%
  unnest_wider(xml, transform = ~ .x[[1]])

df1 %>%
  left_join(xml_df) %>%
  select(!c(firstName, territory))
# A tibble: 8 × 4
  txnID date      repID lastName  
  <dbl> <chr>     <dbl> <chr>     
1  1001 8/5/2020    203 Kappoorthy
2  1002 6/28/2020   887 da Silva  
3  1003 6/28/2020   887 da Silva  
4  1004 4/16/2020   887 da Silva  
5  1005 6/14/2020   887 da Silva  
6  1006 5/19/2020   332 McRowe    
7  1019 9/23/2020   100 Schwab    
8  1027 12/4/2020   221 Sixt
  • Related