Check unique values in multiple columns (treated as one big 'column') in R-CodePudding

How to use dplyr::across to check unique values in multiple columns by group?

This code will still treat each column independently. I would like to have the number of unique values across variables DX1:DX4 together.

Here id=1 would have 5 unique values: A,B, C, D, F. ID 2 would also have 5 A, B, C, D, E.

library(dplyr)
x <- dat %>%
  group_by(id) %>%
  summarize(across(DX1:DX4, n_distinct, na.rm=T))

df <- read.table(header = TRUE, text = "
id DX1 DX2 DX3 DX4
1 A B A A
1 A A A C
1 D A A A
1 A A A F
1 A A A A
2 A A A A
2 A C A A
2 A A A D
2 A E D B
", stringsAsFactors = FALSE)

CodePudding user response：

After grouping by 'id', use across to select the columns, unlist/flatten_chr and get the number of distinct elements (n_distinct)

library(dplyr)
library(purrr)
df %>%
  group_by(id) %>% 
  summarise(n = n_distinct(flatten_chr(across(DX1:DX4)), na.rm = TRUE), 
     .groups = 'drop')

-output

# A tibble: 2 × 2
     id     n
  <int> <int>
1     1     5
2     2     5

CodePudding user response：

I don't think across is the "tidyverse" way to go. I suggest cur_data() instead.

df %>%
  group_by(id) %>%
  summarise(n = n_distinct(unlist(select(cur_data(),DX1:DX4))))
## A tibble: 2 × 2
#     id     n
#  <int> <int>
#1     1     5
#2     2     5

CodePudding user response：

Base R:

> df=read.table(header=T,text="id DX1 DX2 DX3 DX4\n1 A B A A\n1 A A A C\n1 D A A A\n1 A A A F\n1 A A A A\n2 A A A A\n2 A C A A\n2 A A A D\n2 A E D B")
> sapply(split(df[,-1],df[,1]),\(x)length(unique(unlist(x))))
1 2
5 5