I know how to import a simple csv
file using R
. But, is it possible to import a file to R
including variable and value labels (similar to SPSS sav
files).
Or instead, shall I have two csv
files? One for data and the other for metadata (variable and value labels)?
Something similar to this (resulting from two csv
files). But I think I have a problem with the syntax of the tuples for val_lab
:
> data
# A tibble: 6 × 2
se ctr
<chr> <chr>
1 1 1
2 1 2
3 2 3
4 2 2
5 1 1
6 2 3
> metadata
# A tibble: 2 × 3
var var_label val_lab
<chr> <chr> <chr>
1 se sex (1,'Female'),(2,'Male')
2 ctr country (1,'UK'),(2,'USA'),(3,'France')
Using dput
:
> dput(head(data))
structure(list(se = c("1", "1", "2", "2", "1", "2"), ctr = c("1",
"2", "3", "2", "1", "3")), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
> dput(metadata)
structure(list(var = c("se", "ctr"), var_label = c("sex", "country"
), val_lab = c("(1,'Female'),(2,'Male')", "(1,'UK'),(2,'USA'),(3,'France')"
)), row.names = c(NA, -2L), spec = structure(list(cols = list(
var = structure(list(), class = c("collector_character",
"collector")), var_label = structure(list(), class = c("collector_character",
"collector")), val_lab = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), delim = ";"), class = "col_spec"), problems = <pointer: 0x00000149af86d620>, class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))
CodePudding user response:
You can do it like this in this case:
for(each_var in metadata$var) {
each_label <- metadata$val_lab[metadata$var==each_var]
# Get data out of weird tuple format
values_list <- strsplit(
gsub("\\(|\\)|'", "", strsplit(each_label, "\\),\\(")[[1]]),
","
)
#
values_df <- do.call(rbind, values_list) |> data.frame() |> setNames(c("values", "labels"))
data[[each_var]] <- factor(data[[each_var]], levels = values_df$values, labels = values_df$labels)
# Set variable label - this will show up in Rstudio viewer
for(each_var in metadata$var) {
attr(data[[each_var]], "labels") <- metadata$var_label[metadata$var==each_var]
}
}
data
# A tibble: 6 x 2
# se ctr
# <fct> <fct>
# 1 Female UK
# 2 Female USA
# 3 Male France
# 4 Male USA
# 5 Female UK
# 6 Male France
data
is now a table of factors, which as Gregor Thomas says is how R deals with this type of data.
Note that most of this code is actually getting the labels and levels out of the tuple converted to a string format. The actual setting of the levels is data[[each_var]] <- factor(data[[each_var]], levels = values_df$values, labels = values_df$labels)
, so if you can write the levels directly to a data frame rather than a tuple then it should be much more straightforward.