I have two files: a .csv file that has ugly numbered files and a dictionary. Here's a similar and simplified example of the .csv file:
responses <- tribble(
~n_case, ~movie, ~song, ~sex,
1, 2, 2, 0,
2, 1, 1, 0,
3, 2, 4, 1
)
And here's a similar and simplified example of the dictionary database:
# Note the jump from 2 to 4 in 'song' is deliberate,
# as this appears in the actual database that I am working with
dic <- tribble(
~name, ~value, ~tag,
"movie", 1, "Shrek",
"movie", 2, "Lego",
"song", 1, "Hallelujah",
"song", 2, "Happy Birthday",
"song", 4, "Melancholy Hill",
"sex", 0, "male",
"sex", 1, "female"
)
What could I do to replace the ugly "values" by the human-readable "tags"? I'd like to arrive at this:
> responses_human
# A tibble: 3 × 4
n_case movie song sex
<dbl> <chr> <chr> <chr>
1 1 Lego Happy Birthday male
2 2 Shrek Hallelujah male
3 3 Lego Melancholy Hill female
In case you're curious, I am working with Guatemala's person census (as opposed to house or home census).
I tried solving my problem by filtering each 'name' (survey question's answers; i.e. "movie", "song", "sex"), grabbing the result and creating a new tibble out of that. That way, I'd end up with dozens of tibbles, each being a mini-dictionary for each survey question's answers (movie, song, sex, etc.). Then, I'd left_join
these filtered mini-dictionaries, one by one, resulting in dozens of new human-readable columns. I'd then appropriately rename
and select
the columns so that I'd get rid of the ugly numbers and only keep the human-readable "tags".
However, this solution requires naming dozens of variables and is therefore a slow process. Note that there are 84 variables, each with multiple numbered/ugly "values" and human-readable "tags". Moreover, I will probably use the other (house and home) censuses, which pose the same problem.
I have a hunch that there's a much more efficient and automated way of doing this. I just don't know what it is yet.
CodePudding user response:
One option would be to first split your data dictionary into a list
containing a recoding vector per name
. In a second step you could use this list
inside across
to recode the columns of your dataset:
library(dplyr, w = FALSE)
library(tibble)
dic_split <- split(dic, dic$name) %>%
lapply(function(x) {
x %>%
select(-name) %>%
tibble::deframe()
})
rec_helper <- function(x, col) {
recode(x, !!!dic_split[[col]])
}
responses %>%
mutate(across(
all_of(intersect(names(dic_split), names(responses))),
~ rec_helper(.x, cur_column())
))
#> # A tibble: 3 × 4
#> n_case movie song sex
#> <dbl> <chr> <chr> <chr>
#> 1 1 Lego Happy Birthday male
#> 2 2 Shrek Hallelujah male
#> 3 3 Lego Melancholy Hill female
CodePudding user response:
You could go long, join, wide:
library(tidyverse)
responses |>
pivot_longer(-n_case) |>
left_join(dic, by = c("name", "value")) |>
select(-value) |>
pivot_wider(names_from = name, values_from = tag)
#> # A tibble: 3 x 4
#> n_case movie song sex
#> <dbl> <chr> <chr> <chr>
#> 1 1 Lego Happy Birthday male
#> 2 2 Shrek Hallelujah male
#> 3 3 Lego Melancholy Hill female
or with factor:
responses |>
mutate(across(movie:sex, \(x) as.character(factor(x,
levels = dic[dic$name == cur_column(),]$value,
labels = dic[dic$name == cur_column(),]$tag))))
#> # A tibble: 3 x 4
#> n_case movie song sex
#> <dbl> <chr> <chr> <chr>
#> 1 1 Lego Happy Birthday male
#> 2 2 Shrek Hallelujah male
#> 3 3 Lego Melancholy Hill female
CodePudding user response:
You can reshape the dictionary to wide format and then use it to lookup values across your response data:
library(dplyr)
library(tidyr)
dic_wide <- dic %>%
pivot_wider(names_from = name, values_from = tag)
responses %>%
mutate(across(-n_case, ~ dic_wide[[cur_column()]][match(.x, dic_wide$value)]))
# A tibble: 3 × 4
n_case movie song sex
<dbl> <chr> <chr> <chr>
1 1 Lego Happy Birthday male
2 2 Shrek Hallelujah male
3 3 Lego Melancholy Hill female