Home > Software design >  Is it possible to import data and metadata from a single csv file to R
Is it possible to import data and metadata from a single csv file to R

Time:06-29

I know how to import a simple csv file using R. But, is it possible to import a file to R including variable and value labels (similar to SPSS sav files).

Or instead, shall I have two csv files? One for data and the other for metadata (variable and value labels)?

Something similar to this (resulting from two csv files). But I think I have a problem with the syntax of the tuples for val_lab:

> data
# A tibble: 6 × 2
  se    ctr  
  <chr> <chr>
1 1     1    
2 1     2    
3 2     3    
4 2     2    
5 1     1    
6 2     3    
> metadata
# A tibble: 2 × 3
  var   var_label val_lab                        
  <chr> <chr>     <chr>                          
1 se    sex       (1,'Female'),(2,'Male')        
2 ctr   country   (1,'UK'),(2,'USA'),(3,'France')

Using dput:

> dput(head(data))
structure(list(se = c("1", "1", "2", "2", "1", "2"), ctr = c("1", 
"2", "3", "2", "1", "3")), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame"))
> dput(metadata)
structure(list(var = c("se", "ctr"), var_label = c("sex", "country"
), val_lab = c("(1,'Female'),(2,'Male')", "(1,'UK'),(2,'USA'),(3,'France')"
)), row.names = c(NA, -2L), spec = structure(list(cols = list(
    var = structure(list(), class = c("collector_character", 
    "collector")), var_label = structure(list(), class = c("collector_character", 
    "collector")), val_lab = structure(list(), class = c("collector_character", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
"collector")), delim = ";"), class = "col_spec"), problems = <pointer: 0x00000149af86d620>, class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"))

CodePudding user response:

You can do it like this in this case:

for(each_var in metadata$var) {

    each_label  <- metadata$val_lab[metadata$var==each_var]

    # Get data out of weird tuple format
    values_list  <- strsplit(
        gsub("\\(|\\)|'", "", strsplit(each_label, "\\),\\(")[[1]]), 
        ","
    ) 
    # 
    values_df  <- do.call(rbind, values_list)  |> data.frame()  |>  setNames(c("values", "labels"))


    data[[each_var]]  <- factor(data[[each_var]], levels = values_df$values, labels = values_df$labels)

    # Set variable label - this will show up in Rstudio viewer 
    for(each_var in metadata$var) {
        attr(data[[each_var]], "labels")  <- metadata$var_label[metadata$var==each_var]
    }

}

data
# A tibble: 6 x 2
#   se     ctr     
#   <fct>  <fct>   
# 1 Female UK      
# 2 Female USA     
# 3 Male   France  
# 4 Male   USA     
# 5 Female UK
# 6 Male   France

data is now a table of factors, which as Gregor Thomas says is how R deals with this type of data.

Note that most of this code is actually getting the labels and levels out of the tuple converted to a string format. The actual setting of the levels is data[[each_var]] <- factor(data[[each_var]], levels = values_df$values, labels = values_df$labels), so if you can write the levels directly to a data frame rather than a tuple then it should be much more straightforward.

  • Related