How to find the number of factors by different Ids using dplyr?-CodePudding

I am trying to converge my data by organization ids. To do so, I need to find how many different domains there are for each id (excluding NAs). I can't figure out how to do it in R using dplyr package. Any help is appreciated.

This is something similar to what my data looks like.

id	domain
1	ca
1	ca
1	com
1	na
1	com
2	org
2	na
2	ca
3	org
3	ca
3	com
3	net

What I want is to get the following output:

id	count_domain
1	2
2	2
3	4

I don't want to do this in excel, but if that's the more efficient approach, I am also open to answers in excel.

PS: My actual data contains much more rows and types of domain, so writing 'if statements' is not something that would be feasible.

CodePudding user response：

I assume that your data has NA and not literal string "na", then you could try

library(tidyverse)

df <- read_table("id    domain
1   ca
1   ca
1   com
1   NA
1   com
2   org
2   NA
2   ca
3   org
3   ca
3   com
3   net") 

df %>%  
  group_by(id) %>%
  summarise(count_domain = sum(!is.na(unique(domain))))

# A tibble: 3 x 2
     id count_domain
  <dbl>        <int>
1     1            2
2     2            2
3     3            4

CodePudding user response：

For completeness, this can be achieved with the following solution in Excel (assuming that your data is stored in the range A:B; and na should not count according to your sample data).

First, enter the following formula in, e.g., cell D2, to create the unique set of IDs (this could also be specified more dynamically, if needed for e.g., varying lengths of the data):

=UNIQUE(A2:A13)

To compute the number of domains, enter the following formula in cell E2:

=SUM(--(LEN(UNIQUE(FILTER(B:B,(A:A=D2)*(B:B<>"na"),"")))>0))