How do I count the number of distinct values of a column based on another column criterion?-CodePudding

Let's say I have a dataframe that looks like:

owner     unit_id   detail
abc123    002NH034  94847DT
abc123    002NH034  94868DT
abc123    002NH034  94889DT
abc123    112NH035  94899DT
abc123    112NH036
abc123    112NH037

I'm trying to roll up to the level of the first column, while counting the number of distinct values in the second column that have a value in the third column. So expected output would be:

abc123  2

I've tried a mix of sum, is.na, and a few other dplyr functions but am a little stuck. I've tried a variation of the following with random other functions but keep getting sums of all the rows.

df %>%  group_by(owner) %>% summarise(unit_ids_with_detail = sum(!is.na(detail))

CodePudding user response：

This should do:

df %>% filter(!is.na(detail)) %>%  group_by(owner) %>% 
summarise(unit_ids_with_detail = n_distinct(unit_id))

Remove NAs from third column, group by first column, count distinct values on second column. This will drop owners that have only NAs on the third column, though

CodePudding user response：

If we want the distinct elements, use n_distinct - Grouped by 'owner', use n_distinct on the 'unit_id' and specify na.rm = TRUE to remove the NA elements

library(dplyr)
df %>%  
   group_by(owner) %>% 
   summarise(unit_ids_with_detail = n_distinct(detail, na.rm = TRUE))