I have a panel data set with two waves. I want to calculate the (possible) change from wave 1 to wave 2 of a selected variable and add the result as a new variable to the dataset. Two challenges: There are a considerable number of respondents who participated either only in the first wave or only in the second wave. Also, there are many NAs, which together make the use of lag or diff function difficult I think.
Example:
RespondentID <- c("id_001","id_002","id_002","id_003", "id_004", "id_004", "id_005", "id_005")
Wave <- c(1,1,2,2,1,2,1,2)
Value <- c(2,2,3,4,3,1,NA,3)
panel <- cbind(RespondentID,Wave,Value)
panel
RespondentID Wave Value
[1,] "id_001" "1" "2"
[2,] "id_002" "1" "2"
[3,] "id_002" "2" "3"
[4,] "id_003" "2" "4"
[5,] "id_004" "1" "3"
[6,] "id_004" "2" "1"
[7,] "id_005" "1" NA
[8,] "id_005" "2" "3"
The result should look like this:
RespondentID Wave Value Change
[1,] "id_001" "1" "2" NA
[2,] "id_002" "1" "2" NA
[3,] "id_002" "2" "3" "1"
[4,] "id_003" "2" "4" NA
[5,] "id_004" "1" "3" NA
[6,] "id_004" "2" "1" "-2"
[7,] "id_005" "1" NA NA
[8,] "id_005" "2" "3" NA
CodePudding user response:
You could use a data structure that recognises the panel data property of your data and "panel-aware" functions. pdata.frame
in package plm
is one candidate. See the example below, picking up from your example (but putting the various variables via data.frame
together first (your approach with cbind
lumps the columns together but converts to a character)):
panel <- data.frame(RespondentID,Wave,Value)
library(plm)
pdata <- pdata.frame(panel, index = c("RespondentID", "Wave"))
pdata$Change <- diff(pdata$Value)
pdata
#> RespondentID Wave Value Change
#> id_001-1 id_001 1 2 NA
#> id_002-1 id_002 1 2 NA
#> id_002-2 id_002 2 3 1
#> id_003-2 id_003 2 4 NA
#> id_004-1 id_004 1 3 NA
#> id_004-2 id_004 2 1 -2
#> id_005-1 id_005 1 NA NA
#> id_005-2 id_005 2 3 NA
The diff
on a pdata.frame's column is a panel-aware diff.
CodePudding user response:
First I create the example data using the data.frame
function instead of cbind
since I would like to have the 'Value' column to have numeric values.
library(data.table)
RespondentID <- c("id_001","id_002","id_002","id_003", "id_004", "id_004", "id_005", "id_005")
Wave <- c(1,1,2,2,1,2,1,2)
Value <- c(2,2,3,4,3,1,NA,3)
panel <- data.frame(RespondentID,Wave,Value)
panel
I create a new panel by reshaping the panel data into a wide format.
new_panel <- as.data.table(panel)
new_panel <- dcast(new_panel,
RespondentID~Wave,
fill = NA_integer_,
value.var = c("Value"))
I change the column names to something meaningful
colnames(new_panel)[2:3] <- c("Value_1", "Value_2")
The transformed data looks like this.
new_panel
RespondentID Value_1 Value_2
1: id_001 2 NA
2: id_002 2 3
3: id_003 NA 4
4: id_004 3 1
5: id_005 NA 3
Now I create the Change column and the result looks like below.
new_panel[,Change:=Value_2 - Value_1]
new_panel
RespondentID Value_1 Value_2 Change
1: id_001 2 NA NA
2: id_002 2 3 1
3: id_003 NA 4 NA
4: id_004 3 1 -2
5: id_005 NA 3 NA
I hope this helps.