Home > database >  Using Tidyverse tools (dplyr, purrr, etc.), how to replace ugly coded values for the human-readable
Using Tidyverse tools (dplyr, purrr, etc.), how to replace ugly coded values for the human-readable

Time:11-29

I have two files: a .csv file that has ugly numbered files and a dictionary. Here's a similar and simplified example of the .csv file:

responses <- tribble(
  ~n_case, ~movie, ~song, ~sex,
  1, 2, 2, 0,
  2, 1, 1, 0,
  3, 2, 4, 1
  
)

And here's a similar and simplified example of the dictionary database:

# Note the jump from 2 to 4 in 'song' is deliberate,
# as this appears in the actual database that I am working with
dic <- tribble(
  ~name, ~value, ~tag,
  "movie", 1, "Shrek",
  "movie", 2, "Lego",
  "song", 1, "Hallelujah",
  "song", 2, "Happy Birthday",
  "song", 4, "Melancholy Hill",
  "sex", 0, "male",
  "sex", 1, "female"
)

What could I do to replace the ugly "values" by the human-readable "tags"? I'd like to arrive at this:

> responses_human
# A tibble: 3 × 4
  n_case movie song            sex   
   <dbl> <chr> <chr>           <chr> 
1      1 Lego  Happy Birthday  male  
2      2 Shrek Hallelujah      male  
3      3 Lego  Melancholy Hill female

In case you're curious, I am working with Guatemala's person census (as opposed to house or home census).

I tried solving my problem by filtering each 'name' (survey question's answers; i.e. "movie", "song", "sex"), grabbing the result and creating a new tibble out of that. That way, I'd end up with dozens of tibbles, each being a mini-dictionary for each survey question's answers (movie, song, sex, etc.). Then, I'd left_join these filtered mini-dictionaries, one by one, resulting in dozens of new human-readable columns. I'd then appropriately rename and select the columns so that I'd get rid of the ugly numbers and only keep the human-readable "tags".

However, this solution requires naming dozens of variables and is therefore a slow process. Note that there are 84 variables, each with multiple numbered/ugly "values" and human-readable "tags". Moreover, I will probably use the other (house and home) censuses, which pose the same problem.

I have a hunch that there's a much more efficient and automated way of doing this. I just don't know what it is yet.

CodePudding user response:

One option would be to first split your data dictionary into a list containing a recoding vector per name. In a second step you could use this list inside across to recode the columns of your dataset:

library(dplyr, w = FALSE)
library(tibble)

dic_split <- split(dic, dic$name) %>%
  lapply(function(x) {
    x %>%
      select(-name) %>%
      tibble::deframe()
  })

rec_helper <- function(x, col) {
  recode(x, !!!dic_split[[col]])
}

responses %>%
  mutate(across(
    all_of(intersect(names(dic_split), names(responses))),
    ~ rec_helper(.x, cur_column())
  ))
#> # A tibble: 3 × 4
#>   n_case movie song            sex   
#>    <dbl> <chr> <chr>           <chr> 
#> 1      1 Lego  Happy Birthday  male  
#> 2      2 Shrek Hallelujah      male  
#> 3      3 Lego  Melancholy Hill female

CodePudding user response:

You could go long, join, wide:

library(tidyverse)

responses |>
  pivot_longer(-n_case) |>
  left_join(dic, by = c("name", "value")) |>
  select(-value) |>
  pivot_wider(names_from = name, values_from = tag)
#> # A tibble: 3 x 4
#>   n_case movie song            sex   
#>    <dbl> <chr> <chr>           <chr> 
#> 1      1 Lego  Happy Birthday  male  
#> 2      2 Shrek Hallelujah      male  
#> 3      3 Lego  Melancholy Hill female

or with factor:

responses |>
  mutate(across(movie:sex, \(x) as.character(factor(x,
                                                    levels = dic[dic$name == cur_column(),]$value,
                                                    labels = dic[dic$name == cur_column(),]$tag))))
#> # A tibble: 3 x 4
#>   n_case movie song            sex   
#>    <dbl> <chr> <chr>           <chr> 
#> 1      1 Lego  Happy Birthday  male  
#> 2      2 Shrek Hallelujah      male  
#> 3      3 Lego  Melancholy Hill female

CodePudding user response:

You can reshape the dictionary to wide format and then use it to lookup values across your response data:

library(dplyr)
library(tidyr)

dic_wide <- dic %>%
  pivot_wider(names_from = name, values_from = tag)

responses %>%
  mutate(across(-n_case, ~ dic_wide[[cur_column()]][match(.x, dic_wide$value)]))

# A tibble: 3 × 4
  n_case movie song            sex   
   <dbl> <chr> <chr>           <chr> 
1      1 Lego  Happy Birthday  male  
2      2 Shrek Hallelujah      male  
3      3 Lego  Melancholy Hill female
  • Related