I have a lot of columns in 1 dataframe that identify different timepoints of the same variable. Basically, within my data, if there's no response at timepoint X-1, there will be no response at time point X or beyond (after an NA appears in a row, it will continue). I currently have a column that shows which row the last response came from and what that response is. The dataframe currently looks like this:
id X1 X2 X3 X4 X_final X_final_location
1 1 5 5 6 NA 6 X3
2 2 4 NA NA NA 4 X1
3 3 7 1 3 5 5 X4
4 4 8 2 4 2 2 X4
5 5 1 5 NA NA 5 X2
6 6 5 7 7 7 7 X4
My goal is to be able to conduct a regression using the last response of each row as the outcome variable. However, I don't want it to repeat twice in the "X_final" column and also in the column that the response actually comes from. Therefore, I am hoping to find a way to put a "." in for the cell where that value originally came from so it looks like this:
id X1 X2 X3 X4 X_final X_final_location
1 1 5 5 6 NA 6 X3
2 2 . <NA> NA NA 4 X1
3 3 7 1 3 5 5 X4
4 4 8 2 4 2 2 X4
5 5 1 . NA NA 5 X2
6 6 5 7 7 7 7 X4
Any suggestions would be appreciated - thank you!
CodePudding user response:
One way to do this (NA instead of . to preserve data type):
match
finds the first NA position, replace
replaces the value in that position - 1 (previous) with NA.
apply(data, 1, \(x) ...)
applies that function for each row. Finally t
transposes the result (since apply by default coerces the result to columns.
data = data.frame(id = 1:6, X1 = c(5L, 4L, 7L, 8L, 1L, 5L), X2 = c(5L,
NA, 1L, 2L, 5L, 7L), X3 = c(6L, NA, 3L, 4L, NA, 7L), X4 = c(NA,
NA, 5L, 2L, NA, 7L), X_final = c(6L, 4L, 5L, 2L, 5L, 7L), X_final_location = c("X3",
"X1", "X4", "X4", "X2", "X4"))
data[,2:5] <- t(apply(data[,2:5], 1 , function(x) replace(x, match(NA, x) - 1, NA)))
data
#> id X1 X2 X3 X4 X_final X_final_location
#> 1 1 5 5 NA NA 6 X3
#> 2 2 NA NA NA NA 4 X1
#> 3 3 7 1 3 5 5 X4
#> 4 4 8 2 4 2 2 X4
#> 5 5 1 NA NA NA 5 X2
#> 6 6 5 7 7 7 7 X4
Another way using split (grouping by row):
split(data, row.names(data)) <-
lapply(split(data, row.names(data)), \(x) replace(x, x$X_final_location, "."))
CodePudding user response:
Another method, since you already have the locations in $X_final_location
. As mentioned in the question comments, NA
values would be preferred if the goal would be regression analysis to preserve numeric values.
data_orig <- data.frame(
id = c(1, 2, 3, 4, 5, 6),
X1 = c(5, 4, 7, 8, 1, 5),
X2 = c(5, NA, 1, 2, 5, 7),
X3 = c(6, NA, 3, 4, NA, 7),
X4 = c(NA, NA, 5, 2, NA, 7),
X_final = c(6, 4, 5, 2, 5, 7),
X_final_location = c("X3", "X1", "X4", "X4", "X2", "X4")
)
data_new <- data_orig
for (i in seq_len(nrow(data_new))) {
data_new[i, data_new$X_final_location[i]] <- NA
}
data_new
# id X1 X2 X3 X4 X_final X_final_location
# 1 1 5 5 NA NA 6 X3
# 2 2 NA NA NA NA 4 X1
# 3 3 7 1 3 NA 5 X4
# 4 4 8 2 4 NA 2 X4
# 5 5 1 NA NA NA 5 X2
# 6 6 5 7 7 NA 7 X4