For each person in my dataset (1 row per person), I am trying search a set of variables (months, so in my example Jan - Jul) to see if any of them matches the value in a different variable (follow-up month). I want to create a new variable that says yes or no there is a matching value in the set of variables with the 1 variable.
Basically I am trying to create a timeline for a follow-up visit. I have 'Have' and 'Want' data sets below.
Thank you!
HAVE:
ID | Jan | Feb | Mar | Apr | May | June | Jul | Follow-up month |
---|---|---|---|---|---|---|---|---|
1 | NA | 2 | 3 | 4 | NA | NA | NA | 4 |
2 | NA | NA | NA | 4 | NA | NA | NA | 6 |
3 | 1 | NA | 3 | 4 | 5 | NA | NA | 5 |
4 | NA | NA | NA | NA | NA | 6 | 7 | 9 |
WANT:
ID | Jan | Feb | Mar | Apr | May | June | Jul | Follow-up month | Follow_up_Status |
---|---|---|---|---|---|---|---|---|---|
1 | NA | 2 | 3 | 4 | NA | NA | NA | 4 | Yes |
2 | NA | NA | NA | 4 | NA | NA | NA | 6 | No |
3 | 1 | NA | 3 | 4 | 5 | NA | NA | 5 | Yes |
4 | NA | NA | NA | NA | NA | 6 | 7 | 9 | No |
CodePudding user response:
Here is a version with pivoting:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(
-c(ID, Follow.up_month)
) %>%
group_by(ID) %>%
mutate(Follow_up_status = ifelse(Follow.up_month %in% value, "Yes", "No")) %>%
pivot_wider(
names_from = name,
values_from = value
)
output:
ID Follow.up_month Follow_up_status Jan Feb Mar Apr May June Jul
<int> <int> <chr> <int> <int> <int> <int> <int> <int> <int>
1 1 4 Yes NA 2 3 4 NA NA NA
2 2 6 No NA NA NA 4 NA NA NA
3 3 5 Yes 1 NA 3 4 5 NA NA
4 4 9 No NA NA NA NA NA 6 7
CodePudding user response:
I think rowwise
and if_any
will work for you:
library(dplyr)
quux %>%
rowwise() %>%
mutate(Follow2 = if_any(Jan:Jul, ~ . %in% Follow.up.month)) %>%
ungroup()
# # A tibble: 4 x 11
# ID Jan Feb Mar Apr May June Jul Follow.up.month Follow_up_Status Follow2
# <int> <int> <int> <int> <int> <int> <int> <int> <int> <chr> <lgl>
# 1 1 NA 2 3 4 NA NA NA 4 Yes TRUE
# 2 2 NA NA NA 4 NA NA NA 6 No FALSE
# 3 3 1 NA 3 4 5 NA NA 5 Yes TRUE
# 4 4 NA NA NA NA NA 6 7 9 No FALSE
(It also works well with ~ Follow.up.month %in% .
.)
Data
quux <- structure(list(ID = 1:4, Jan = c(NA, NA, 1L, NA), Feb = c(2L, NA, NA, NA), Mar = c(3L, NA, 3L, NA), Apr = c(4L, 4L, 4L, NA), May = c(NA, NA, 5L, NA), June = c(NA, NA, NA, 6L), Jul = c(NA, NA, NA, 7L), Follow.up.month = c(4L, 6L, 5L, 9L), Follow_up_Status = c("Yes", "No", "Yes", "No")), class = "data.frame", row.names = c(NA, -4L))
CodePudding user response:
Another dplyr
solution.
library(dplyr)
dat2 <- dat %>%
mutate(across(Jan:Jul, .fns = ~.x - Follow_up_month == 0)) %>%
mutate(Follow_up_status = as.character(rowSums(select(., Jan:Jul), na.rm = TRUE))) %>%
transmute(Follow_up_status = recode(Follow_up_status, "0" = "No", "1" = "Yes")) %>%
bind_cols(dat, .)
dat2
# ID Jan Feb Mar Apr May June Jul Follow_up_month Follow_up_status
# 1 1 NA 2 3 4 NA NA NA 4 Yes
# 2 2 NA NA NA 4 NA NA NA 6 No
# 3 3 1 NA 3 4 5 NA NA 5 Yes
# 4 4 NA NA NA NA NA 6 7 9 No
Date
dat <- structure(list(ID = 1:4, Jan = c(NA, NA, 1L, NA), Feb = c(2L, NA, NA, NA), Mar = c(3L, NA, 3L, NA), Apr = c(4L, 4L, 4L, NA), May = c(NA, NA, 5L, NA), June = c(NA, NA, NA, 6L), Jul = c(NA, NA, NA, 7L), Follow_up_month = c(4L, 6L, 5L, 9L)), class = "data.frame", row.names = c(NA, -4L))
Performance
When the data frame is small, all the solutions here will work. But when data frame is large, the pivoting approach and the rowwise approach may be slow. Below I tried to show the performance comparison of the three solutions. Although the final outputs are different, with different data type and column order, I will still compare them, assuming that these differences are acceptable.
Here is the setup.
library(microbenchmark)
library(dplyr)
library(tidyr)
pivot_fun <- function(x){
x2 <- x %>%
pivot_longer(
-c(ID, Follow_up_month)
) %>%
group_by(ID) %>%
mutate(Follow_up_status = ifelse(Follow_up_month %in% value, "Yes", "No")) %>%
pivot_wider(
names_from = name,
values_from = value
)
return(x2)
}
rowwise_fun <- function(x){
x2 <- x %>%
pivot_longer(
-c(ID, Follow_up_month)
) %>%
group_by(ID) %>%
mutate(Follow_up_status = ifelse(Follow_up_month %in% value, "Yes", "No")) %>%
pivot_wider(
names_from = name,
values_from = value
)
return(x2)
}
rowSums_fun <- function(x){
x2 <- x %>%
mutate(across(Jan:Jul, .fns = ~.x - Follow_up_month == 0)) %>%
mutate(Follow_up_status = as.character(rowSums(select(., Jan:Jul), na.rm = TRUE))) %>%
transmute(Follow_up_status = recode(Follow_up_status, "0" = "No", "1" = "Yes")) %>%
bind_cols(x, .)
return(x2)
}
Here is the comparison on the original example. The solution provided in this post is the fastest.
set.seed(1)
microbenchmark(pivot_fun(dat), rowwise_fun(dat), rowSums_fun(dat))
# Unit: milliseconds
# expr min lq mean median uq max neval
# pivot_fun(dat) 11.037401 11.927201 13.58003 12.659001 13.882151 30.0207 100
# rowwise_fun(dat) 10.907602 11.670701 13.56004 12.295051 13.614201 24.4249 100
# rowSums_fun(dat) 6.590502 7.147702 8.48469 7.714351 8.808602 17.0109 100
And here is a comparison on a larger data frame. The solution provided in this post is about 10 times faster than other answers.
set.seed(12)
n <- 100000
dat_n <- data.frame(
ID = 1:n,
Jan = sample(dat$Jan, size = n, replace = TRUE),
Feb = sample(dat$Feb, size = n, replace = TRUE),
Mar = sample(dat$Mar, size = n, replace = TRUE),
Apr = sample(dat$Apr, size = n, replace = TRUE),
May = sample(dat$May, size = n, replace = TRUE),
June = sample(dat$June, size = n, replace = TRUE),
Jul = sample(dat$Jul, size = n, replace = TRUE),
Follow_up_month = sample(1:12, size = n, replace = TRUE)
)
set.seed(123)
microbenchmark(pivot_fun(dat_n), rowwise_fun(dat_n), rowSums_fun(dat_n))
# Unit: milliseconds
# expr min lq mean median uq max neval
# pivot_fun(dat_n) 1168.416 1405.5724 1496.6545 1471.0253 1574.3927 2327.1624 100
# rowwise_fun(dat_n) 1159.790 1401.0586 1494.9987 1465.8929 1580.0092 1982.5099 100
# rowSums_fun(dat_n) 84.494 102.0946 122.2843 111.8158 123.6288 296.3234 100