Compile similar variables of multiple editions of the same dataset into one dataset in R-CodePudding

I am working with multiple editions of the same survey dataset. It is an individual level survey asked (almost) annually, however the questions/coverage is subject to change. Additionally, while similar questions are asked in most waves, the question IDs have changed. I need to compile the datasets into one cumulative dataset. To do this, I would like to extract only the questions of interest, and recode them so that all of the variables end up the same.

I have created a key for myself with the range of coverage of each question for each year. A section of it is reflected in the following:

Edition P           S        A       E     F
1995    p33         s1       s2     s17   s20
1996    p40         s1       s2     s14   s16a
1997    sp58        s1       s2     s10   s12a
1998    sp53        s1       s2     s11   s14a
2000    P54ST       S1       S2      S6  reeduc1
2001    p55st       s1       s2      s6  reeduc1
2002    p45st       s1       s2      s6  reeduc1
2003    p54st       s1       s2      s6    s18
2004    p30st       s1       s2      s6   reeduc1
2005    p48st       s6       s7     s11   reeduc1
2006    p38st       s6       s7     s11   reeduc1
2007    p64st       s10      s11    s15   reeduc1
2008    p61st       s8       s9     s15   reeduc1
2009    P35ST       s5       s6     s12   reeduc1
2010    P29ST       S7       S8     S14   REEDUC1
2011    P38ST       S16      S17    S21   REEDUC1
2013    P22TGBSM    S10      S11    S17   REEDUC_1
2015    P23TGBSM    S12      S13    S19   REEDUC_1
2016    P15STGBS    SEXO    EDAD    S13   REEDUC_1
2017    P16STGBS    SEXO    EDAD    S14   REEDUC.1
2018    P21STGBS.A  SEXO    EDAD    S10 
2020    P50STGBS.A  SEXO    EDAD    S16   REEDUC.1

The "Edition" Column is the edition (wave) of the survey. The other columns are all questions asked throughout the dataset, whose names I have assigned letters which could become the new standardized column names for each question. Below those column names are all of the individual variable names within each corresponding survey wave. Note that some questions are not asked in a particular survey wave, and they are thus left blank in the key. For example, the gap in the above key for Variable F in 2018 is a question that was missing. I still need this dataset included within the cumulative dataset, so I would need to input NA values for that question and survey wave (and any other missing questions in a particular wave).

Suppose each file is a .csv file named "surveyXXXX" (with the year of the edition). What would be the simplest, cleanest way to combine the desired questions from each survey wave into one cumulative dataset, preserving all the responses as individual responses? Is there a way to streamline the process such that it can be seemlessly coded for all desired questions that I include in my key?

Example data from a segment of the survey 2020.csv data

structure(list(numinves = c(2020L, 2020L, 2020L, 2020L, 2020L
), idenpa = c(32L, 32L, 32L, 32L, 32L), numentre = 3200001:3200005, 
    reg = c(32002L, 32001L, 32002L, 32002L, 32211L), ciudad = c(32301917L, 
    32001001L, 32301955L, 32301932L, 32211004L), tamciud = c(-4L, 
    -4L, -4L, -4L, -4L), comdist = c(2L, 1L, 280L, 230L, 926L
    ), edad = c(63L, 24L, 20L, 54L, 38L), sexo = c(2L, 1L, 1L, 
    2L, 1L), codigo = c(-4L, -4L, -4L, -4L, -4L)), row.names = c("1", 
"2", "3", "4", "5"), class = "data.frame")

As you can see, the variable values are all numerical. This is true for all survey waves, as the numerical values correspond to external codebooks. Obviously, not all of the desired variables are at the start of the dataframe, nor do they appear in the same order of each wave.

CodePudding user response：

I just saw zephryl's answer but I had already done it so here's a non-tidyverse-based solution:


    data_stack <- data.frame(
      Edition = c(1995, 1996, 1997, 1998, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2013, 2015, 2016, 2017, 2018, 2020),
      P = c("p33", "p40", "sp58", "sp53", "P54ST", "p55st", "p45st", "p54st", "p30st", "p48st", "p38st", "p64st", "p61st", "P35ST", "P29ST", "P38ST", "P22TGBSM", "P23TGBSM", "P15STGBS", "P16STGBS", "P21STGBS.A", "P50STGBS.A"),
      S = c("s1", "s1", "s1", "s1", "S1", "s1", "s1", "s1", "s1", "s6", "s6", "s10", "s8", "s5", "S7", "S16", "S10", "S12", "SEXO", "SEXO", "SEXO", "SEXO"),
      A = c("s2", "s2", "s2", "s2", "S2", "s2", "s2", "s2", "s2", "s7", "s7", "s11", "s9", "s6", "S8", "S17", "S11", "S13", "EDAD", "EDAD", "EDAD", "EDAD"),
      E = c("s17", "s14", "s10", "s11", "S6", "s6", "s6", "s6", "s6", "s11", "s11", "s15", "s15", "s12", "S14", "S21", "S17", "S19", "S13", "S14", "S10", "S16"),
      `F` = c("s20", "s16a", "s12a", "s14a", "reeduc1", "reeduc1", "reeduc1", "s18", "reeduc1", "reeduc1", "reeduc1", "reeduc1", "reeduc1", "reeduc1", "REEDUC1", "REEDUC1", "REEDUC_1", "REEDUC_1", "REEDUC_1", "REEDUC.1", NA_character_, "REEDUC.")
      )
    
    edition_2015 <- data.frame(P23TGBSM = c(3, 4, 5),
                           S12 = c(6, 7, 8))
    
    edition_2016 <- data.frame(P15STGBS = c(10, 11, 12),
                           SEXO = c(13, 14, 15))
     
    edition_2015
    edition_2016
    
    for (year in 2015:2016) {
      
      edition <- get(paste0("edition_", year))
      
      for (i in 1:ncol(edition)) {
        colnames(edition)[i] <- colnames(data_stack[data_stack$Edition == year,])[(which(as.matrix(data_stack[data_stack$Edition == year,]) == colnames(edition)[i], arr.ind = TRUE))[2]]
      }
      assign(paste0("edition_", year), edition)
    }
    
    edition_2015
    edition_2016

CodePudding user response：

Here's a tidyverse-based solution:

library(tidyr)
library(purrr)
library(readr)

# Convert key into a list of named character vectors; one per year, with
# original names as values and new names as names
keys_by_year <- key %>% 
  pivot_longer(!Edition) %>% 
  drop_na(value) %>% 
  split(.$Edition) %>% 
  map(\(df) purrr::set_names(df$value, df$name))

# import surveys, naming each with its year
surveys_raw <- map(
  purrr::set_names(c("1995", "2018")),
  \(year) read_csv(paste0("survey", year, ".csv"))
)

# for each survey, select variables wanted by indexing with the key for that
# year, then rename using names from key
surveys <- imap_dfr(
  surveys_raw, 
  \(survey, year) {
    year_key <- keys_by_year[[year]]
    survey <- survey[year_key]
    colnames(survey) <- names(year_key)
    survey
  },
  .id = "Edition"
  )

# result
surveys

# A tibble: 6 × 6
  Edition     P     S     A     E     F
  <chr>   <dbl> <dbl> <dbl> <dbl> <dbl>
1 1995        3     4     2     4     4
2 1995        5     3     5     1     5
3 1995        2     1     4     3     5
4 2018        2     1     1     1    NA
5 2018        4     1     4     1    NA
6 2018        5     1     3     3    NA

Example data:

set.seed(13)

# including a few "unwanted" columns in each
survey1995 <- data.frame(
  p33 = sample(1:5, 3, replace = TRUE),
  p35 = sample(1:5, 3, replace = TRUE),
  s1 = sample(1:5, 3, replace = TRUE),
  s2 = sample(1:5, 3, replace = TRUE),
  s3 = sample(1:5, 3, replace = TRUE),
  s17 = sample(1:5, 3, replace = TRUE),
  s20 = sample(1:5, 3, replace = TRUE)
)

survey2018 <- data.frame(
  P21STGBS.A = sample(1:5, 3, replace = TRUE),
  SEXO = sample(1:5, 3, replace = TRUE),
  EDAD = sample(1:5, 3, replace = TRUE),
  IDIOMA = sample(1:5, 3, replace = TRUE),
  S10 = sample(1:5, 3, replace = TRUE)
)

write_csv(survey1995, "survey1995.csv")
write_csv(survey2018, "survey2018.csv")