Determine if any of the values in a set of variables match the value in another variable-CodePudding

For each person in my dataset (1 row per person), I am trying search a set of variables (months, so in my example Jan - Jul) to see if any of them matches the value in a different variable (follow-up month). I want to create a new variable that says yes or no there is a matching value in the set of variables with the 1 variable.

Basically I am trying to create a timeline for a follow-up visit. I have 'Have' and 'Want' data sets below.

Thank you!

HAVE:

ID	Jan	Feb	Mar	Apr	May	June	Jul	Follow-up month
1	NA	2	3	4	NA	NA	NA	4
2	NA	NA	NA	4	NA	NA	NA	6
3	1	NA	3	4	5	NA	NA	5
4	NA	NA	NA	NA	NA	6	7	9

WANT:

ID	Jan	Feb	Mar	Apr	May	June	Jul	Follow-up month	Follow_up_Status
1	NA	2	3	4	NA	NA	NA	4	Yes
2	NA	NA	NA	4	NA	NA	NA	6	No
3	1	NA	3	4	5	NA	NA	5	Yes
4	NA	NA	NA	NA	NA	6	7	9	No

CodePudding user response：

Here is a version with pivoting:

library(dplyr)
library(tidyr)

df %>% 
  pivot_longer(
    -c(ID, Follow.up_month)
  ) %>% 
  group_by(ID) %>% 
  mutate(Follow_up_status = ifelse(Follow.up_month %in% value, "Yes", "No")) %>% 
  pivot_wider(
    names_from = name,
    values_from = value
  )

output:

     ID Follow.up_month Follow_up_status   Jan   Feb   Mar   Apr   May  June   Jul
  <int>           <int> <chr>            <int> <int> <int> <int> <int> <int> <int>
1     1               4 Yes                 NA     2     3     4    NA    NA    NA
2     2               6 No                  NA    NA    NA     4    NA    NA    NA
3     3               5 Yes                  1    NA     3     4     5    NA    NA
4     4               9 No                  NA    NA    NA    NA    NA     6     7

CodePudding user response：

I think rowwise and if_any will work for you:

library(dplyr)
quux %>%
  rowwise() %>%
  mutate(Follow2 = if_any(Jan:Jul, ~ . %in% Follow.up.month)) %>%
  ungroup()
# # A tibble: 4 x 11
#      ID   Jan   Feb   Mar   Apr   May  June   Jul Follow.up.month Follow_up_Status Follow2
#   <int> <int> <int> <int> <int> <int> <int> <int>           <int> <chr>            <lgl>  
# 1     1    NA     2     3     4    NA    NA    NA               4 Yes              TRUE   
# 2     2    NA    NA    NA     4    NA    NA    NA               6 No               FALSE  
# 3     3     1    NA     3     4     5    NA    NA               5 Yes              TRUE   
# 4     4    NA    NA    NA    NA    NA     6     7               9 No               FALSE

(It also works well with ~ Follow.up.month %in% ..)

Data

quux <- structure(list(ID = 1:4, Jan = c(NA, NA, 1L, NA), Feb = c(2L, NA, NA, NA), Mar = c(3L, NA, 3L, NA), Apr = c(4L, 4L, 4L, NA),     May = c(NA, NA, 5L, NA), June = c(NA, NA, NA, 6L), Jul = c(NA,     NA, NA, 7L), Follow.up.month = c(4L, 6L, 5L, 9L), Follow_up_Status = c("Yes",     "No", "Yes", "No")), class = "data.frame", row.names = c(NA, -4L))

CodePudding user response：

Another dplyr solution.

library(dplyr)

dat2 <- dat %>%
  mutate(across(Jan:Jul, .fns = ~.x - Follow_up_month == 0)) %>%
  mutate(Follow_up_status = as.character(rowSums(select(., Jan:Jul), na.rm = TRUE))) %>%
  transmute(Follow_up_status = recode(Follow_up_status, "0" = "No", "1" = "Yes")) %>%
  bind_cols(dat, .)
dat2
#   ID Jan Feb Mar Apr May June Jul Follow_up_month Follow_up_status
# 1  1  NA   2   3   4  NA   NA  NA               4              Yes
# 2  2  NA  NA  NA   4  NA   NA  NA               6               No
# 3  3   1  NA   3   4   5   NA  NA               5              Yes
# 4  4  NA  NA  NA  NA  NA    6   7               9               No

Date

dat <- structure(list(ID = 1:4, Jan = c(NA, NA, 1L, NA), Feb = c(2L, NA, NA, NA), Mar = c(3L, NA, 3L, NA), Apr = c(4L, 4L, 4L, NA),     May = c(NA, NA, 5L, NA), June = c(NA, NA, NA, 6L), Jul = c(NA,     NA, NA, 7L), Follow_up_month = c(4L, 6L, 5L, 9L)), class = "data.frame", row.names = c(NA, -4L))

Performance

When the data frame is small, all the solutions here will work. But when data frame is large, the pivoting approach and the rowwise approach may be slow. Below I tried to show the performance comparison of the three solutions. Although the final outputs are different, with different data type and column order, I will still compare them, assuming that these differences are acceptable.

Here is the setup.

library(microbenchmark)
library(dplyr)
library(tidyr)


pivot_fun <- function(x){
  x2 <- x %>%
    pivot_longer(
      -c(ID, Follow_up_month)
    ) %>% 
    group_by(ID) %>% 
    mutate(Follow_up_status = ifelse(Follow_up_month %in% value, "Yes", "No")) %>% 
    pivot_wider(
      names_from = name,
      values_from = value
    )
  return(x2)
}

rowwise_fun <- function(x){
  x2 <- x %>%
    pivot_longer(
      -c(ID, Follow_up_month)
    ) %>% 
    group_by(ID) %>% 
    mutate(Follow_up_status = ifelse(Follow_up_month %in% value, "Yes", "No")) %>% 
    pivot_wider(
      names_from = name,
      values_from = value
    )
  return(x2)
}

rowSums_fun <- function(x){
  x2 <- x %>%
    mutate(across(Jan:Jul, .fns = ~.x - Follow_up_month == 0)) %>%
    mutate(Follow_up_status = as.character(rowSums(select(., Jan:Jul), na.rm = TRUE))) %>%
    transmute(Follow_up_status = recode(Follow_up_status, "0" = "No", "1" = "Yes")) %>%
    bind_cols(x, .)
  return(x2)
}

Here is the comparison on the original example. The solution provided in this post is the fastest.

set.seed(1)

microbenchmark(pivot_fun(dat), rowwise_fun(dat), rowSums_fun(dat))
# Unit: milliseconds
#             expr       min        lq     mean    median        uq     max neval
#   pivot_fun(dat) 11.037401 11.927201 13.58003 12.659001 13.882151 30.0207   100
# rowwise_fun(dat) 10.907602 11.670701 13.56004 12.295051 13.614201 24.4249   100
# rowSums_fun(dat)  6.590502  7.147702  8.48469  7.714351  8.808602 17.0109   100

And here is a comparison on a larger data frame. The solution provided in this post is about 10 times faster than other answers.

set.seed(12)

n <- 100000

dat_n <- data.frame(
  ID = 1:n,
  Jan = sample(dat$Jan, size = n, replace = TRUE),
  Feb = sample(dat$Feb, size = n, replace = TRUE),
  Mar = sample(dat$Mar, size = n, replace = TRUE),
  Apr = sample(dat$Apr, size = n, replace = TRUE),
  May = sample(dat$May, size = n, replace = TRUE),
  June = sample(dat$June, size = n, replace = TRUE),
  Jul = sample(dat$Jul, size = n, replace = TRUE),
  Follow_up_month = sample(1:12, size = n, replace = TRUE)
)

set.seed(123)

microbenchmark(pivot_fun(dat_n), rowwise_fun(dat_n), rowSums_fun(dat_n))
# Unit: milliseconds
#               expr      min        lq      mean    median        uq       max neval
#   pivot_fun(dat_n) 1168.416 1405.5724 1496.6545 1471.0253 1574.3927 2327.1624   100
# rowwise_fun(dat_n) 1159.790 1401.0586 1494.9987 1465.8929 1580.0092 1982.5099   100
# rowSums_fun(dat_n)   84.494  102.0946  122.2843  111.8158  123.6288  296.3234   100

ID	Jan	Feb	Mar	Apr	May	June	Jul	Follow-up month	Follow_up_Status
1	NA	2	3	4	NA	NA	NA	4	Yes
2	NA	NA	NA	4	NA	NA	NA	6	No
3	1	NA	3	4	5	NA	NA	5	Yes
4	NA	NA	NA	NA	NA	6	7	9	No

ID	Jan	Feb	Mar	Apr	May	June	Jul	Follow-up month	Follow_up_Status
1	NA	2	3	4	NA	NA	NA	4	Yes
2	NA	NA	NA	4	NA	NA	NA	6	No
3	1	NA	3	4	5	NA	NA	5	Yes
4	NA	NA	NA	NA	NA	6	7	9	No

ID	Jan	Feb	Mar	Apr	May	June	Jul	Follow-up month	Follow_up_Status
1	NA	2	3	4	NA	NA	NA	4	Yes
2	NA	NA	NA	4	NA	NA	NA	6	No
3	1	NA	3	4	5	NA	NA	5	Yes
4	NA	NA	NA	NA	NA	6	7	9	No