I am working with multiple editions of the same survey dataset. It is an individual level survey asked (almost) annually, however the questions/coverage is subject to change. Additionally, while similar questions are asked in most waves, the question IDs have changed. I need to compile the datasets into one cumulative dataset. To do this, I would like to extract only the questions of interest, and recode them so that all of the variables end up the same.
I have created a key for myself with the range of coverage of each question for each year. A section of it is reflected in the following:
Edition P S A E F
1995 p33 s1 s2 s17 s20
1996 p40 s1 s2 s14 s16a
1997 sp58 s1 s2 s10 s12a
1998 sp53 s1 s2 s11 s14a
2000 P54ST S1 S2 S6 reeduc1
2001 p55st s1 s2 s6 reeduc1
2002 p45st s1 s2 s6 reeduc1
2003 p54st s1 s2 s6 s18
2004 p30st s1 s2 s6 reeduc1
2005 p48st s6 s7 s11 reeduc1
2006 p38st s6 s7 s11 reeduc1
2007 p64st s10 s11 s15 reeduc1
2008 p61st s8 s9 s15 reeduc1
2009 P35ST s5 s6 s12 reeduc1
2010 P29ST S7 S8 S14 REEDUC1
2011 P38ST S16 S17 S21 REEDUC1
2013 P22TGBSM S10 S11 S17 REEDUC_1
2015 P23TGBSM S12 S13 S19 REEDUC_1
2016 P15STGBS SEXO EDAD S13 REEDUC_1
2017 P16STGBS SEXO EDAD S14 REEDUC.1
2018 P21STGBS.A SEXO EDAD S10
2020 P50STGBS.A SEXO EDAD S16 REEDUC.1
The "Edition" Column is the edition (wave) of the survey. The other columns are all questions asked throughout the dataset, whose names I have assigned letters which could become the new standardized column names for each question. Below those column names are all of the individual variable names within each corresponding survey wave. Note that some questions are not asked in a particular survey wave, and they are thus left blank in the key. For example, the gap in the above key for Variable F in 2018 is a question that was missing. I still need this dataset included within the cumulative dataset, so I would need to input NA values for that question and survey wave (and any other missing questions in a particular wave).
Suppose each file is a .csv file named "surveyXXXX" (with the year of the edition). What would be the simplest, cleanest way to combine the desired questions from each survey wave into one cumulative dataset, preserving all the responses as individual responses? Is there a way to streamline the process such that it can be seemlessly coded for all desired questions that I include in my key?
Example data from a segment of the survey 2020.csv data
structure(list(numinves = c(2020L, 2020L, 2020L, 2020L, 2020L
), idenpa = c(32L, 32L, 32L, 32L, 32L), numentre = 3200001:3200005,
reg = c(32002L, 32001L, 32002L, 32002L, 32211L), ciudad = c(32301917L,
32001001L, 32301955L, 32301932L, 32211004L), tamciud = c(-4L,
-4L, -4L, -4L, -4L), comdist = c(2L, 1L, 280L, 230L, 926L
), edad = c(63L, 24L, 20L, 54L, 38L), sexo = c(2L, 1L, 1L,
2L, 1L), codigo = c(-4L, -4L, -4L, -4L, -4L)), row.names = c("1",
"2", "3", "4", "5"), class = "data.frame")
As you can see, the variable values are all numerical. This is true for all survey waves, as the numerical values correspond to external codebooks. Obviously, not all of the desired variables are at the start of the dataframe, nor do they appear in the same order of each wave.
CodePudding user response:
I just saw zephryl's answer but I had already done it so here's a non-tidyverse-based solution:
data_stack <- data.frame(
Edition = c(1995, 1996, 1997, 1998, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2013, 2015, 2016, 2017, 2018, 2020),
P = c("p33", "p40", "sp58", "sp53", "P54ST", "p55st", "p45st", "p54st", "p30st", "p48st", "p38st", "p64st", "p61st", "P35ST", "P29ST", "P38ST", "P22TGBSM", "P23TGBSM", "P15STGBS", "P16STGBS", "P21STGBS.A", "P50STGBS.A"),
S = c("s1", "s1", "s1", "s1", "S1", "s1", "s1", "s1", "s1", "s6", "s6", "s10", "s8", "s5", "S7", "S16", "S10", "S12", "SEXO", "SEXO", "SEXO", "SEXO"),
A = c("s2", "s2", "s2", "s2", "S2", "s2", "s2", "s2", "s2", "s7", "s7", "s11", "s9", "s6", "S8", "S17", "S11", "S13", "EDAD", "EDAD", "EDAD", "EDAD"),
E = c("s17", "s14", "s10", "s11", "S6", "s6", "s6", "s6", "s6", "s11", "s11", "s15", "s15", "s12", "S14", "S21", "S17", "S19", "S13", "S14", "S10", "S16"),
`F` = c("s20", "s16a", "s12a", "s14a", "reeduc1", "reeduc1", "reeduc1", "s18", "reeduc1", "reeduc1", "reeduc1", "reeduc1", "reeduc1", "reeduc1", "REEDUC1", "REEDUC1", "REEDUC_1", "REEDUC_1", "REEDUC_1", "REEDUC.1", NA_character_, "REEDUC.")
)
edition_2015 <- data.frame(P23TGBSM = c(3, 4, 5),
S12 = c(6, 7, 8))
edition_2016 <- data.frame(P15STGBS = c(10, 11, 12),
SEXO = c(13, 14, 15))
edition_2015
edition_2016
for (year in 2015:2016) {
edition <- get(paste0("edition_", year))
for (i in 1:ncol(edition)) {
colnames(edition)[i] <- colnames(data_stack[data_stack$Edition == year,])[(which(as.matrix(data_stack[data_stack$Edition == year,]) == colnames(edition)[i], arr.ind = TRUE))[2]]
}
assign(paste0("edition_", year), edition)
}
edition_2015
edition_2016
CodePudding user response:
Here's a tidyverse-based solution:
library(tidyr)
library(purrr)
library(readr)
# Convert key into a list of named character vectors; one per year, with
# original names as values and new names as names
keys_by_year <- key %>%
pivot_longer(!Edition) %>%
drop_na(value) %>%
split(.$Edition) %>%
map(\(df) purrr::set_names(df$value, df$name))
# import surveys, naming each with its year
surveys_raw <- map(
purrr::set_names(c("1995", "2018")),
\(year) read_csv(paste0("survey", year, ".csv"))
)
# for each survey, select variables wanted by indexing with the key for that
# year, then rename using names from key
surveys <- imap_dfr(
surveys_raw,
\(survey, year) {
year_key <- keys_by_year[[year]]
survey <- survey[year_key]
colnames(survey) <- names(year_key)
survey
},
.id = "Edition"
)
# result
surveys
# A tibble: 6 × 6
Edition P S A E F
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1995 3 4 2 4 4
2 1995 5 3 5 1 5
3 1995 2 1 4 3 5
4 2018 2 1 1 1 NA
5 2018 4 1 4 1 NA
6 2018 5 1 3 3 NA
Example data:
set.seed(13)
# including a few "unwanted" columns in each
survey1995 <- data.frame(
p33 = sample(1:5, 3, replace = TRUE),
p35 = sample(1:5, 3, replace = TRUE),
s1 = sample(1:5, 3, replace = TRUE),
s2 = sample(1:5, 3, replace = TRUE),
s3 = sample(1:5, 3, replace = TRUE),
s17 = sample(1:5, 3, replace = TRUE),
s20 = sample(1:5, 3, replace = TRUE)
)
survey2018 <- data.frame(
P21STGBS.A = sample(1:5, 3, replace = TRUE),
SEXO = sample(1:5, 3, replace = TRUE),
EDAD = sample(1:5, 3, replace = TRUE),
IDIOMA = sample(1:5, 3, replace = TRUE),
S10 = sample(1:5, 3, replace = TRUE)
)
write_csv(survey1995, "survey1995.csv")
write_csv(survey2018, "survey2018.csv")