rowSums() to count number of both non-missing and unique values-CodePudding

Let's say I have this dataframe

> df
   mr_daterd mr_daterd_fu1 mr_daterd_fu2
1 2018-03-05    2018-03-05          <NA>
2 2019-05-04          <NA>    2020-03-05
3 2020-01-03    2020-06-06    2021-04-02

Each row represent a patient and the dates represent MRI scans. I want to count the number of MRI scans per row, i.e. rowSums() of non-missing values. However, some patient had several scans at the same date. Therefore, the rowSums() should only count unique non-missing values.

Eg, using

df_new <- df %>%
  mutate(
n_mri = rowSums(!is.na(select(., contains('mr_daterd'))))
)

Gives

> df_new
   mr_daterd mr_daterd_fu1 mr_daterd_fu2 n_mri
1 2018-03-05    2018-03-05          <NA>     2
2 2019-05-04          <NA>    2020-03-05     2
3 2020-01-03    2020-06-06    2021-04-02     3

The n_mri for row 1 should be 1, and not 2, because 2018-03-05 is duplicated in mr_daterd and mr_daterd_fu1.

Expected output:

> df_new
   mr_daterd mr_daterd_fu1 mr_daterd_fu2 n_mri
1 2018-03-05    2018-03-05          <NA>     1
2 2019-05-04          <NA>    2020-03-05     2
3 2020-01-03    2020-06-06    2021-04-02     3

Data

df <- structure(list(mr_daterd = structure(c(17595, 18020, 18264), class = "Date"), 
    mr_daterd_fu1 = structure(c(17595, NA, 18419), class = "Date"), 
    mr_daterd_fu2 = structure(c(NA, 18326, 18719), class = "Date")), class = "data.frame", row.names = c(NA, 
-3L))

CodePudding user response：

dplyr solution using n_distinct and c_across.

df %>% 
  rowwise %>% 
  mutate(n_mri = n_distinct(
    c_across(contains('mr_daterd')), 
    na.rm=TRUE)) %>%
  ungroup()


# A tibble: 3 × 4
# Rowwise: 
  mr_daterd  mr_daterd_fu1 mr_daterd_fu2 n_mri
  <date>     <date>        <date>        <int>
1 2018-03-05 2018-03-05    NA                1
2 2019-05-04 NA            2020-03-05        2
3 2020-01-03 2020-06-06    2021-04-02        3

CodePudding user response：

With base R, you could use apply():

apply(df, 1, \(x) sum(!is.na(unique(x))))

# [1] 1 2 3

CodePudding user response：

Another possible solution, based on purrr::pmap:

library(tidyverse)

df %>% 
  mutate(n_mri = pmap_int(., ~ n_distinct(c(...), na.rm = T)))

#>    mr_daterd mr_daterd_fu1 mr_daterd_fu2 n_mri
#> 1 2018-03-05    2018-03-05          <NA>     1
#> 2 2019-05-04          <NA>    2020-03-05     2
#> 3 2020-01-03    2020-06-06    2021-04-02     3