I have a dataset that stores instances for participants vertically over time. They can have basically any number of follow-ups, there are participants with anywhere from 1 to 14 lines, but more are expected to be added with time.
I have a list of variables var
that the participants have presumably reported in each follow-up and want to create a new set of "ever" variables vare
that describe if at any time before this follow-up, a participant reported "yes" for the corresponding variable.
Here is an example of the desired input/output:
var = c("var1","var2")
vare = paste0(var,"_ever")
data = data.frame(idno = c(123,123,123,123,123,123,123)
followup_num = c(0,1,2,3,4,5,6)
var1 = c(0,NA,0,1,0,NA,1)
var2 = c(1,NA,NA,0,0,0,1)
)
data$var1_ever = c(0,0,0,1,1,1,1)
data$var2_ever = c(1,1,1,1,1,1,1)
idno | followup_num | var1 | var1_ever | var2 | var2_ever |
---|---|---|---|---|---|
123 | 0 | 0 | 0 | 1 | 1 |
123 | 1 | NA | 0 | NA | 1 |
123 | 2 | 0 | 0 | NA | 1 |
123 | 3 | 1 | 1 | 0 | 1 |
123 | 4 | 0 | 1 | 0 | 1 |
123 | 5 | NA | 1 | 0 | 1 |
123 | 6 | 1 | 1 | 1 | 1 |
This is the code I am currently using. Obviously, nested for loops are not ideal in R and this segment of code is particularly slow when handed a few thousand lines.
#For each ID
for (i in unique(data$idno)) {
id = data$idno%in%i #Get the relevant lines for this ID
fus = sort(data$followup_num[id]) #Get the follow-up numbers
#For each variable in the list
for (v in seq_along(var)) {
#Loop through the follow-ups. If you see that the variable reports "yes", mark
# this and every proceeding follow-up as having reported that variable ever
# Otherwise, mark the opposite at that line and move to the next follow-up
for (f in fus) {
if (t(data[id & data$followup_num%in%f,var[v]])%in%1) {
data[id & data$followup_num >= f,vare[v]] = 1
break
} else {
data[id & data$followup_num%in%f,vare[v]] = 0
}
}
}
}
Is this a problem with an existing solution? Is there a way to optimize/simplify? Is there a use of apply/sapply/etc. functions that I neglected to try?
CodePudding user response:
At its core, the solution is the base function cummax()
. We need to take into account NA
, so I added replace_na()
. And we need to account for additional idno's by using group_by()
A minimal vectorized solution is
df$var1_test<-cummax(x=replace_na(df$var1, 0))
This is a great problem to solve with the tidyverse
mutate
across
function set!
df = data.frame(idno = c(123,123,123,123,123,123,123),
followup_num = c(0,1,2,3,4,5,6),
var1 = c(0,NA,0,1,0,NA,1),
var2 = c(1,NA,NA,0,0,0,1))
df %>% group_by(idno) %>%
arrange(idno, followup_num) %>%
mutate(across(.cols=starts_with("var"),
.fns= ~ cummax(tidyr::replace_na(.x, 0)),
.names="{.col}_ever2"))
idno followup_num var1 var2 var1_ever2 var2_ever2 1 123 0 0 1 0 1 2 123 1 NA NA 0 1 3 123 2 0 NA 0 1 4 123 3 1 0 1 1 5 123 4 0 0 1 1 6 123 5 NA 0 1 1 7 123 6 1 1 1 1
Alternativly, if you want to summarize the data to a single row, then a grouped max works
df %>%
group_by(idno) %>%
summarise(across(.cols=starts_with("var"),
.fns= ~ max(.x, na.rm=T),
.names="{.col}_ever3"))
idno var1_ever3 var2_ever3 1 123 1 1
ps. data
is an internal function, better to call variable df
.
CodePudding user response:
Consider ave
cummax
(with ifelse
to handle NAs):
data <- within(
data, {
var2_ever <- ave(var2, idno, FUN=\(x) cummax(ifelse(is.na(x), 0, x)))
var1_ever <- ave(var1, idno, FUN=\(x) cummax(ifelse(is.na(x), 0, x)))
}
)
For many columns:
vars <- names(data)[grep("var", names(data))]
data[paste0(vars, "_ever")] <- sapply(
vars, \(var) ave(data[[var]], data$idno, FUN=\(x) cummax(ifelse(is.na(x), 0, x)))
)